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ABSTRACT 


As computing machines advance, new fields are explored and old ones are ex- 
panded. This thesis considers parallel solutions to several well-known problems 
from numerical linear algebra, including Gauss Factorization and the method of 
Conjugate Gradients. The Gauss algorithm was implemented on two parallel ma- 
chines: an Intel iPSC/2, and a network of INMOS T-800 transputers. Interprocessor 
communication—in both cases—was borne by a hypercube interconnection topology. 

The results reveal general findings from parallel computing and more specific 
data and information concerning the systems and algorithms that were employed. 
Communication is timed and the results are analyzed, showing typical features of 
a message passing system. System performance is illustrated by results from the 
Gauss codes. The use of two different pivoting strategies shows the potential and 
the limitations of a parallel machine. The iPSC/2 and transputer systems both 
show excellent parallel performance when solving large, dense, unstructured systems. 
Differences, advantages, and disadvantages of these two systems are examined and 


expectations for current and future machines are discussed. 
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THESIS DISCLAIMER 


The computer programs developed in this research have not been exercised for 
all cases of interest. Every reasonable effort has been made to eliminate computa- 
tional and logical errors, but the programs should not be considered fully verified. 
Any application of these programs without additional verification is at the user’s 
risk. A reasonable effort has been put forth to make the code efficient. Optimization 
has been suppressed, however, in areas where it would jeopardize the simplicity and 
clarity of the algorithm without great reward in terms of performance. 

IMS, inmos, and occam are trademarks of INMOS Limited, a member of the 
SGS-THOMSON Microelectronics Group. INTEL, intel, and iPSC are trademarks 
of Intel Corporation. IBM, PC AT, and PC XT are registered trademarks of Inter- 
national Business Machines Corporation. CIO, LD-ONE, LD-NET, TASM, TCX, 
TIO, TLIB, and TLNK are trademarks of Logical Systems. MS-DOS is a trademark 
of Microsoft Corporation. MATLAB is a trademark of The MathWorks, Inc. All 
other brand and product names are trademarks or registered trademarks of their 


respective companies. 
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I, PREFACE 


The need for speed accompanied by reliability has driven many advances in machine 
design. The history of computing is replete with examples—many from scientific 
fields—where necessity became the impetus for faster, more reliable machinery. 
Without exception, history and past designs have played key roles in the invention 
of new equipment. The maturity of mechanical calculator design was foundational 
in the construction of electronic computers. Today’s multiprocessor computers are 
extensions of uniprocessor machines and include technology developed by our tele- 
phone industry. Many well-worn tools and lessons from the past can be applied. 
Many new ideas must be put to the test. This thesis is about applying old principles 


and evaluating new tools and equipment. 
A. A SURVEY OF COMPUTING MACHINERY 


Nothing 1s more important than to see the sources of invention, which are, 
in my opinion, more interesting than the inventions themselves. 


— GOTTFRIED WILHELM LEIBNIZ (1646-1716) 
1. Beginnings 


The history of mathematics and computing is as old as civilization. Tools 
like the abacus have been used to simplify arithmetic problems. Wilhelm Schickhard 
(1592-1635), Blaise Pascal (1623-1662), and Gottfried Wilhelm Leibniz designed and 
built mechanical, gear-driven calculators. The latest of these was essentially a four- 
function calculator. By the mid-1800s, Charles Babbage had designed his Difference 


Engine and proceeded to the more advanced Analytical Engine. These machines were 


never completed (at least not to the grand scale that Babbage planned), but the basic 
design of the Analytical Engine hes at the heart of any modern computer. Consider 
his motivation. 

The following erample was frequently cited by Charles Babbage (1792-1871) 

to justify the construction of his first computing machine, the Difference Engine 
[Ref. 1]. In 1794 a project was begun by the French government under the direction 
of Baron Gaspard de Prony (1755-1839) to compute entirely by hand an enormous 
set of mathematical tables. Among the tables constructed were the logarithms of 
the natural numbers from 1 to 200,000 calculated to 19 decimal places. Comparable 
tables were constructed for the natural sines and tangents, their logarithms, and the 
logarithms of the ratios of the sines and tangents to their arcs. The entire project 
took about 2 years to complete and employed from 70 to 100 people. The mathemat- 
ical abilities of most of the people involved were limited to addition and subtraction. 
A small group of skilled mathematicians provided them with their instructions. To 
minimize errors, each number was calculated twice by two independent human cal- 
culators and the results were compared. The final set of tables occupied 17 large 
folio volumes (which were never published, however). The table of logarithms of the 
natural numbers alone was estimated to contain about 8 million digits. 

This quote, from Hayes [Ref. 2: p. 1], helps to explain why computers 
exist and shows some of the incentive for making them better. Computing ma- 
chinery is designed for speed and reliability. A computer’s “performance” should 
be measured against both of these components. Speed normally receives the most 
attention. Reliability, by whatever label you choose to give it, rarely receives due 
(and/or timely) attention. Too often errors and issues of correctness receive careful 
consideration in reactive—not proactive—situations. Kahan says, “The Fast drives 
out the Slow even if the Fast is wrong” [Ref. 3: p. 596]. 

The correctness side of performance is a much tougher game; and reliability 
can be a fairly subjective matter. Often we pursue solutions that are “good enough” 
(and this cannot always be defined). Time, on the other hand, has well-defined units 
and the standards for measuring time enjoy a history as old as the first sunrise. The 


ease with which the programmer can access the machine’s clock makes measurements 


of this side of performance somewhat easier. 
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Figure 1.1: Technologies and Computing Speed 


Industry demands fast machines because “time is money” and speed alone 
can make difficult, time-consuming problems tolerable. Without doubt, the speed 
of a processor and execution time are important performance considerations. But 
speed is partly dependent upon technology. Babbage’s designs represented quite an 
advance, but they could not be realized in his day. Technology can determine which 
designs succeed, and to what extent. Figure 1.1 compares several recent technologies 


using speed (measured in operations per second) as the yardstick. The data for this 


illustration was taken from Hayes [Ref. 2: p. 9]. As the figure indicates, it was nearly 


a century after Babbage’s work when major technological advances came about. 
2. Electricity 


Significant gains in speed were made possible when electricity could be used 
in computer engineering. The United States census of 1890 employed punched cards 
that were read using electricity and light. Herman Hollerith (1860-1929), the de- 
signer of these cards, formed a company that would later join others and (in 1924) 
take on the name International Business Machines Corporation. Punched paper tape 
was later used by IBM in the Harvard Mark I, a general-purpose electromechani- 
cal computer designed by Howard Aiken (1900-1973). In the late 1930s, at Iowa 
State University, John V. Atanasoff was creating a special-purpose machine to solve 
systems of linear equations. He is credited with “the first attempt to construct an 
electronic computer using vacuum tubes” [Ref. 2: p. 16]. 

In 1943, J. Presper Eckert and John W. Mauchly began work—at the Uni- 
versity of Pennsylvania—to direct the creation of “the first widely known general- 
purpose electronic computer”. The Electronic Numerical Integrator and Calculator 
(ENIAC) project was funded by the U. S. Army Ordnance Department. The 30-ton 
machine was completed in 1946. It held more than 18,000 vacuum tubes. It could 
perform a ten-digit multiplication in three milliseconds, three orders of magnitude 


faster than the Harvard Mark I. [Ref. 2: pp. 17-18] 
3. First Generation Computers 


From Babbage’s Analytical Engine to ENIAC, computer architectures held 
data and programs in separate memories. In 1945, John von Neumann (1903-1957) 
proposed the stored—program concept (i.e., programs and data could be stored in 


the same memory unit). The Hungarian-born mathematician’s involvement in the 


ENIAC project is not remembered by many, but the “von Neumann architecture” 
has become commonplace. In fact, it “has become synonymous with any computer 
of conventional design independent of its date of introduction” [Ref. 2: p. 31]. 
Hennessy and Patterson [Ref. 3: pp. 23-24] object to the widespread use of this 
term, claiming that Eckert and Mauchly deserved more of the credit. 

In 1946, von Neumann (and others) began to design such an architecture 
at the Institute for Advanced Studies (IAS), Princeton. This machine, now called 
the IAS computer, is representative of so-called first-generation computers (as Hayes 
points out: “a somewhat short-sighted view of computer history”). The IAS machine 
was roughly ten times faster than ENIAC [Ref. 3: p. 24]. During the 1946-1948 
timeframe, A. W. Burks, H. H. Goldstine, and John von Neumann wrote a series of 
reports describing the IAS design and programming. The advances and refinements 
in computer design that came out of this period were important and lasting. By 
1950, von Neumann and his colleagues had formed a foundation of theory and design 


worthy of advanced technology. [Ref. 2: pp. 19-20] 
4. Transistors 


The change from vacuum tube to transistor technology marked the begin- 
ning of the “sccond-generation” of computers (approximately 1955-1964). Transis- 
tor technology provided faster switching elements, but this was not the only change 
of the decade. Many of the plans of the late forties and early fifties involved memory, 
so it was fitting that ferrite cores and magnetic drums be used for faster main mem- 
ories. Changes such as these led Hennessy and Patterson to conclude that “cheaper 
computers” were the principal new product of the early 1960s [Ref. 3: p. 26]. 

Additionally, machines began to become more sophisticated. The space and 


tasks of the central processing unit (CPU) and main memories were decentralized 


with the advent of special-purpose processors to augment the CPU and special- 


purpose memories (e.g., registers) to augment the main memory. Finally, system 
software was becoming a greater issue. Programming continued moving upward, 
away from the machine level, and the processing of batch jobs was becoming more 


automated. [Ref. 2: pp. 31-32] 
5. Integrated Circuits 


The first integrated circuit (IC) was introduced in 1961 [Ref. 4: p. 1], and 
the use of ICs would be among the most significant advances evident in third- 
generation computers (starting about 1965). Integrated circuits brought major 
changes in cost, maintenance, reliability, and the amount of real estate required. 
Other than these hardware improvements (circuits and memory), third-generation 
computing was not easy to distinguish from that of the second generation. There was 
some migration from hardware to software (e.g., microprogramming), more special- 
ized and compartmentalized CPUs (e.g., pipelining), and system software continued 
to advance (e.g., operating systems that could support multiprogramming through 


“time-slicing”). (Ref. 2: p. 40] 
6. Instruction Set Trade—Offs 


A large part of designing computer hardware and software involves analysis 
of cost-performance ratios. Other than genuine advances in design or technology, 
almost every aspect of computer architecture involves trade-offs. There is usually 
a spectrum of options from which the computer architect chooses, and the “best” 
solutions are not always found near the ends of the spectrum. Performance can rarely 
be optimized with respect to both space and time, so a balance must be sought. This 
space-time conflict and others appear when a designer must select a sophisticated 
instruction set, or a very simple one, or one of the many options along the spectrum 


between these options. 


In the late 1970s and early 1980s both hardware and software became pro- 
gressively more sophisticated. Instructions became longer and more complex. The 
Complex Instruction Set Computer (CISC) was popular. This design has the advan- 
tage of powerful instructions, but the machine must decode each instruction (it is 
a binary code). The decoding process favors brevity because longer instructions re- 
quire more levels of decoding circuitry. Nonetheless, if the longer instructions could 
carry enough meaning, the decoding endeavor would be justified. 

IBM researchers uncovered a provocative statistic—20% of the instruction 
set was carrying 80% of the burden [Ref. 5: p. 5]. The instruction set had become 
too complex. With some help from several researchers and IBM, the Reduced In- 
struction Set Computer (RISC) architecture became popular. RISC machines admit 
a smaller vocabulary, but claim quicker comprehension. In fact, the goal of the RISC 
architectures is one-cycle execution of the instructions [Ref. 5: pp. 6-7]. Hennessy 
and Patterson, both key contributors to the RISC movement, give an indication of 
the current broad acceptance of the RISC architecture {Ref. 3: p. 190]: 

Prior to the RISC architecture movement, the major trend had been highly 
microcoded architectures aimed at reducing the semantic gap. DEC, with the VAX, 
and Intel, uith the t:APX 432, were among the leaders in this approach. In 1989, 
DEC and Intel both announced RISC products—the DECstation 3100 (based on the 
MIPS Computer Systems R2000) and the Intel i860, a new RISC microprocessor. 
With these announcements, RISC technology has achieved very broad acceptance. 


In 1990 it ts hard to find a computer company without a RISC product either 
shipping or in active development. 


Three major research projects were central to early RISC developments. The first— 
the IBM 801—began in the late 1970s, under the direction of John Cocke. In 1980, 
David Patterson and his colleagues at the University of California at Berkeley began 
the RISC-I and RISC-II projects for which the architecture is named. Finally, John 


Hennessy and others at Stanford University “published a description of the MIPS 
machine” in 1981. {Ref. 3: p. 189] 
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7. Multiprocessors and Multicomputers 


The most recent advances in the design of computing machinery include 
parallel and concurrent architectures. The terminology associated with these ma- 
chines has been developing for about twenty-five years, but it is still immature. 
The terms “multiprocessor” and “multicomputer”, for instance, are sometimes used 
with additional meaning. C. Gordon Bell proposes that an MJA{D machine with 
message passing and no shared memory be called a multicomputer. He calls a 
shared-memory MIMD machine a multiprocessor [Ref. 6: p. 1092]. This termi- 
nology seems to be on the way to acceptance, and it seems useful in giving a general 
characterization to many systems, but it lacks the sort of precision that may be 
necessary. 

First, the word “computer” usually carries many expectations with it. From 
a computer, we expect things like input and output facilities, peripheral devices, and 
so on. These are things that a node on a typical “multicomputer” does not always 
possess. A “processor” is just the opposite. It might be just about any sort of 
processor and we are cautious about attaching any expectations to the term. Many 
processors are special-purpose machines, but (more substantial) central processing 
units and arithmetic logic units are also numbered among processors. The terms 
“computer” and “processor” are not precise. 

Secondly, by automatically associating Flynn’s taxonomy, memory mod- 
els (e.g., shared, distributed), and other things with a terminology, we reduce their 
importance and hide them behind the term. By using the term “multicomputer”, 
without careful definition up front, we run the risk of forgetting that we are talking 
about an MIMD machine that uses message passing and has no shared memory. Ad- 
ditionally, this terminology—packed with expectations—ignores an entire spectrum 


of very real possibilities. Are we saying that a machine cannot employ a combination 


of shared and distributed memory? Using this terminology, how would we say that 
the memory available to each node of a given system was 30 percent shared and 70 
percent local (distributed)? 

Nevertheless, the terms have some use, provided we don’t expect too much 
of them. After all, we distinguish cars from trucks in everyday conversation with 
reasonably little confusion. But—in the same way that it is not prudent to assume 
that “car” implies a vehicle equipped with a V-8 engine and four doors—we should 
be careful to guard against packing too many specifics and expectations into the 
terms “multiprocessor” and “multicomputer.” For this reason, the terms multipro- 
cessor and multicomputer are used almost interchangeably in this work. A conscious 
effort is made to support them with a clear description of the memory paradigm, 
communications facilities, and so on. 

Bell’s terminology identifies the systems used in this work (iPSC/2 and 
transputer networks) as multicomputers. Nevertheless, I often use the term “mul- 
tiprocessor” to identify a system with more than one processor (such as the ones 
described in Chapter V and Appendix B). That is, multiprocessor means nothing 
more than the expected combination of “multi” with “processor.” To forestall confu- 
sion, the rest of the thesis pertains to distributed memory machines that use message 


passing to communicate instructions and data between nodes. 
8. Uniprocessors and Multiprocessors 


At the chip level, multiprocessor systems resemble their single-processor 
predecessors. Experience (e.g., telephone industry, electronic technology ) and a foun- 
dation of theory and design (e.g., von Neumann’s work, network theory) are distinct 
benefits in the development of equipment and techniques for distributed and parallel 
computing. From a system perspective, though, the concurrent use of more than one 


processor creates a fundamentally different environment. 


Uniprocessor systems differ substantially from multiprocessors and multi- 
computers in their ability to access data without competition. In the presence of 
more than one processor—regardless of memory model—there is a need to coordinate 
requests for data. This means that the multicomputer must accommodate interpro- 
cessor communications. The nodes of a multiprocessor system must work together 
efficiently to justify the cost of the resulting system. Some parts of the solution are 
relatively mature, but a vast territory—algorithms, electronic components, media 


for communication, and software engineering techniques—begs further exploration. 


B. CURRENT APPROACHES 


1. Machines 


To compare the capabilities of different machines, some method of bench- 
marking is typically used. By timing the execution of a certain program(s) on a given 
machine we can determine its performance for the given problem. By comparing the 
execution times for the same problem(s) on different machines, we arrive at a notion 
of their relative power. A popular method for sizing up the computing power of 
a machine is the LINPACK benchmarking program [Ref. 7]. This is essentially a 
program involving the solution of a dense system of linear equations. 

Currently, under this LINPACK test, the fastest machines in the world 
have surpassed the gigaflop mark (a billion floating-point operations per second). 
Table 1.1, adapted from Dongarra’s report [Ref. 8: p. 21], shows performance data. 
The leftmost column of this table gives the name of the system and the cycle time (in 
parentheses). The next column contains p, the number of processors used to obtain 
the data that is shown in the four remaining columns. For most systems (e.g., the 
Intel iPSC/860) the size of the system (number of processors used for a given run) 


can be scaled, so data was reported for several different system sizes. 
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Intel Delta (40 MHz) 

Thinking Machines CM-200 (10 MHz) 
Intel Delta (40 MHz) 
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Intel Delta (40 MHz) 

Intel Delta (40 MHz) 

Intel iPSC/860 (40 MHz) 
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Intel Delta (40 MHz) 
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Intel iPSC/860 (40 MHz) 

Fujitsu AP1000 

Intel iPSC/860 (40 MHz) 

nCUBE 2 (20 MHz) 
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Fujitsu AP1000 

Intel iPSC/860 (40 MHz) 
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Fujitsu AP1000 

Intel iPSC/860 (40 MHz) 
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Fujitsu AP1000 
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Intel iPSC/860 (40 MHz) 
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Intel iPSC/860 (40 MHz) 
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nCUBE 2 (20 MHz) 
nCUBE 2 (20 MHz) 
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The column labeled raz gives the performance (in gigaflops) for the largest 
problem run on the machine. The size of that largest problem is indicated by nmaz, 
where n is the dimension of the matrix of coefficients, A € R"”". The ni/2 column 
gives the problem size that yielded a rate of execution that was half of rmaz. Finally, 
peak denotes the theoretical peak performance (in gigaflops) for the machine. 

This data indicates that Intel is the current leader—among companies in 
the United States—of the teraflop race, so we shall take a closer look at their prod- 
ucts. The Intel 1860 microprocessor, together with 8 megabytes of memory, forms 
one of 128 nodes in the hypercube-connected iPSC/860. This machine achieves per- 
formances of nearly two gigaflops with LINPACK. iPSC stands for intel Personal 
SuperComputer, so this entry would not appear to target high-end markets. The 
most significant project in supercomputing at Intel today is the Touchstone project. 

George E. Brown, chairman of the U. $. House Committee on Science, 
Space, and Technology, cut the ribbon around the Intel Touchstone Delta at the 
California Institute of Technology on May 31, 1991 [Ref. 9: p. 96]. The Delta 
is a mesh of 528 nodes. Each node holds an i860 processor and 16 megabytes of 
memory. This machine has reached the 11.9 gigaflop mark with the LINPACK 
benchmark. The closest competitor in the world would appear to be the CM-200 
from Thinking Machines, Inc. This 2,048-node machine benchmarks at 9 gigaflops 
(Ref. 8: p. 21]. The Touchstone program is not over. Intel plans to follow the Delta 
with the Touchstone Sigma. Sigma will have at least 2,048 nodes, each consisting of 
the i860 XP processor (about twice as powerful as the i860). [Ref. 9: p. 96] 

The European high-performance computing market favors the transputer, 
a microprocessor made by INMOS. The New York Times of May 31, 1991 lists one 
German company, Parsytec, and seven American companies—Bolt, Beranek, and 
Newman (BBN), Cray Research, IBM, Intel, NCube, Thinking Machines, and Tera 


Computer—that have entered the teraflop race [Ref. 10]. Parsytec expects their GC 
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to provide “the necessary 2 to 3 orders of magnitude increase in performance above 
existing supercomputers to give scientists the tool to attack their Grand Challenges.” 
(Ref. 10: p. 1] 

Parsytec envisions a system of up to 16,384 processing elements based upon 
the INMOS T9000 transputer (see Chapter VII). This would give the Parsytec ma- 
chine 25-megaflop nodes capable of communications bandwidths near 100 megabytes 
per second. The Parsytec design begins with a cluster of seventeen T9000 processors 
(sixteen primary processors and the seventeenth for backup) and four C104 worm- 
hole routing chips. From four clusters, the company will craft a GigaCube (or simply 
Cube) of 64 processors (not counting redundant elements in the design). The GC- 
1 would represent a one gigaflop system and this would be the building block for 
greater systems (lesser systems can initially be equipped with 16, 32, or 48 nodes). 
The processors in a single (Giga)Cube are arranged in a three-dimensional (4 x 4 x 4) 


grid. (Ref. 10] 
2. Programming Practice 


software engineering for multiprocessor systems is similar to contemporary 
practices for sequential machines. The programming languages used in this work 
provide normal C libraries with additional functions to accommodate interprocessor 
communications. The systems typically provide a loader designed to load executable 
code onto the (host and) nodes according to the programmer’s instructions. Some 
loaders require that the same code be loaded onto each of the nodes. Other, more 
flexible, loaders allow the user to specify which program should be loaded onto each 
node. The Logical Systems C network loader, LD-NET is such a program. It takes 
a Network Information File (NIF), describing the network’s interconnections and 


loading instructions, as input and performs the loading process. 
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C. THE FUTURE 


1. Crossroads 


Parallel and distributed computing is in the early years of a very promising 
lifetime. We should give careful consideration to the direction that the field should 
assume. Lacking years of experience, | will lean on the writings and advice of others 
while trying to peer a little ways into the future of parallel computing. A regrettable 
side effect of this decision is that this section seems to consist primarily of the 
observations and opinions of others. Notwithstanding the many quotations, I believe 
that several important ideas are exposed. 

This business is filled with a combination of old, established ideas and 
proven techniques. It also holds new questions and opportunities. Hamming’s ad- 


vice [Ref. 11: p. 14] seems most fitting in this situation: 


Now I see constantly attempts to force new ideas to old molds. That ts fre- 
quently sensible: How can I make sense of what I’m seeing compared to what I did 
before? But also one must ask, “Am I seeing something fundamentally new?” That 
part many people will not try. You cannot afford to make everything brand new and 
not connect anything together with eristing ideas, nor can you try to make every- 
thing fit into preconceived categories. Some combination of the two ts necessary. 

We limped through the transistor revolution and the computer revolution, 
which are connected with the banduidth revolution; they are all connected together... 
You have to abandon old ideas when you get an order of magnitude of change.... 


— RICHARD W. HAMMING 


Developments in scientific computing today make Dr. Hamming’s thoughts 
especially timely. The field needs to establish a strategy; a direction that will lead 
from its present immaturity to a place of fulfilling its potential. Kenneth Wilson 
proposes Grand Challenges for computational science that may help to establish this 


strategy [Ref. 12]. 
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2. Grand Challenges 


Wilson identifies three modes of scientific activity: theoretical, experi- 
mental, and computational. He defines these areas, claiming that—with today’s 
supercomputers—the most recent science (computational) is becoming more signifi- 
cant. So significant, in fact, that “long experience or professional training is required 
to be successful in computational science at the supercomputer level, making it ap- 
propriate to think of computational science as both a separate mode of scientific 
endeavor and new discipline.” [Ref. 12: p. 172] 

Wilson is careful to distinguish computational science from computer sci- 
ence. He defines computer science as the business of addressing “generic intellectual 
challenges of the computer itself” and characterizes computational science as being 
tailored to specific applications areas (with serious training in the application disci- 
pline) [Ref. 12: p. 172]. To advance computational science, Wilson recommends a 
quantitative approach with clear strategies [Ref. 12: p. 173]: 

The major future opportunities for benefits of supercomputers to basic re- 
search should be identified without the ezisting compromises, but presented as chal- 
lenges to be overcome with the many obstacles to success clearly explained. The 
compromises and inadequactes of current computations need to be described and 
the level of advances required to overcome these inadequacies discussed. Further- 
more, a few key areas with both ertreme difficulties and extraordinary rewards for 
success should be labelled as the “Grand Challenges of Computational Science”. 
Two ezamples are electronic structure and turbulence. No easy promises of success 
in Grand Challenges should be offered. Instead, computational scientists should be 
building plans to assault the Grand Challenges, pushing for the major advances 
in algorithms, software, and technology that will be required for true progress to 
be achieved in these areas. The Grand Challenges should define opportunities to 


open up vast new domains of sctentific research, domains that are inaccessible to 
traditional experimental or theoretical modes of investigation. 


Wilson describes a few examples that demonstrate the limitations of exper- 
imental instrumentation and the potential of supercomputers. Weather prediction, 


astronomy, materials science, molecular biology, aerodynamics, and quantum field 


lo 


theory are the six areas that Wilson chooses to make his point. He describes these 


areas in reasonable detail and briefly mentions other topics. [Ref. 12: pp. 175-179] 


a. Mathematical Background 


Wilson stresses the need for sound design practices and good algorithms. 
(To see why, consider Table A.1). Additionally, he warns that we should spend less 
time in awe of today’s supercomputing power and admit that it is terribly inadequate. 
Modeling methods and sound mathematical background also appear in the “needs 


improvement” category. Wilson [Ref. 12: p. 180] believes that 
Mathematical developments that relate to numerical computation are highly 
important, Theorems about numerical errors or sources of error, eract solutions 
and expansions, eristence and uniqueness proofs and the like, can make a major dif- 


ference in establishing the credibility of a numerical computation. All too frequently 
there 1s too little mathematical understanding backing up numerical simulation. 


b. Issues of Quality 


Wilson does not consider these to be the only problems facing com- 
putational scientists. He believes that quality is endangered, primarily from two 


directions [Ref. 12: pp. 180-181]: 


e A tendency to stay on the safe, easy side; not wandering far from the position: 


“our calculation agrees with experiment.” 


e The quality of computational programs, measured against practical criteria, 
is lacking. The standards include rounding errors (e.g., catastrophic cancella- 


tion), overflows, and stability (with respect to input parameters). 
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c. Languages 


Wilson cites a number of reasons for revolutions in computer languages. 
In particular, he believes that “Fortran is in the long-term the most fundamental 
barrier to progress” [Ref. 12: p. 182]. His approach is realistic enough to recognize 
the vast investments of scientific communities in Fortran. The language cannot and 
should not be eliminated in a day. Nevertheless, it has very serious shortcomings. 
Some problems could be overcome by a Fortran preprocessor (the same idea as the C 
preprocessor). Other problems, like lack of support for abstraction and the unnatural 
exclusion of basic mathematical symbols in the language, are not solved as easily. 
hefel2: p. 182] 

Wilson does not recommend a simple change of language as the solution, 
but searches for deeper problems. He believes that the entire way that computational 
scientists and programmers think about and plan programs must change as well. 
After reading Wilson’s analysis of language problems, the basic impression that 
prevails is that we have an urgent need for general-purpose practices to replace 


patchwork, hit-or-miss, case-by-case solutions. 
3. Generality 


David Harel is also an advocate of the need for general purpose techniques. 


In the preface to his book [Ref. 13: p. viii] he warns: 


Curiously, there appears to be very little written material devoted to the sci- 
ence of computing and aimed at the technically oriented general reader as well as 
the professional. This fact 1s doubly curious in view of the abundance of precisely 
this kind of literature in most other scientific areas, such as physics, biology, chem- 
istry and mathematics, not to mention humanities and the arts. There appears to 
be an acute need for a technically detailed, expository account of the fundamen- 
tals of computer science; one that suffers as little as possible from the bit/byte or 
semicolon syndromes and their derivatives, one that transcends the technological 
and linguistic whirlpool of specifics, and one that is useful both to a sophisticated 
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layperson and to a computer ezpert. It seems that we have all been too busy with 
the revolution to be bothered with satisfying such a need. 


This idea is not unique. One of the other major proponents of general- 
purpose parallel computing is David May of INMOS. In an invited lecture at the the 
Transputing 91 conference [Ref. 14], he highlighted features that general-purpose 
parallel hardware should deliver. Among the important components of a general 


approach, May included the following: 


e Scaling. Performance must scale with number of processors. Efficiency is 
partly dependent on problem size, but—with adequate problem size—systems 
of a thousand processors should be within technological reach. Each processor 


is expected to achieve 10°-10° flops. 


e Portability. This is almost synonymous with “general purpose.” May empha- 
sizes algorithms based upon features common to many machines, and which 
remain valid as technology evolves. He stresses that this general purpose par- 
allel architecture will benefit both the computer designer and the programmer. 
The designer will gain since the market will be somewhat predictable. The 
programmer’s code will work on several machines and hold a strong hope for 


working into future years. 


To achieve these goals, May proposes several guidelines. First, for a message passing 
system using p processors, the nodes must be capable of concurrent computing and 
communication. The interconnection topology must provide scalable throughput 
(linear in p) and bounded delay, probably log(p). Programs, May believes, should be 
written at as high a level as possible and make use of many processes. The algorithm 
should express the maximum possible parallelism. Much of May’s theory is based 


upon the structure of a hypercube interconnection topology (or virtual hypercube). 


4. Projections 


Kenneth Wilson makes a credible claim that says parallel computing is 
here to stay. His reasoning is based upon the fact that mass production and heavy 
competition are proven ingredients in keeping the cost of chips low. Rather than 


summarize, I will quote his conclusion (Ref. 12: p. 185]: 


Today a single processing unit costing millions of dollars can still be cost- 
effective but I don’t think this can last very long, over a period of time (I cannot 
estimate how many years) it seems likely that the mazimum price of a cost-effective 
processor will plunge to one hundred thousand dollars, to ten thousand dollars, to 
???. ITcannot estimate the ultimate equilibrium price at which this plunge will stop. 

Meanwhile I can find no prospects that single supercomputer processors speeds 
will advance at anything like the pace at which processor costs are being reduced, 
even using Gallium Arsenide or superconducting Josephson junctions. 

The result of this is inevitable—overall advances at the supercomputer level 
have to come through parallelism, namely, big increases in speed have to come from 
the stmultaneous use of many processors in parallel. 


David May agrees with Wilson, who states that increasingly complex com- 
ponents and faster clock speeds are not likely avenues of advancement. This makes 
parallel processing “technically attractive.” He also agrees that mass production will 
make the most effective use of design and production facilities. His conclusion: “A 
general purpose parallel architecture would allow cheap, standard multiprocessors to 
become pervasive.” [Ref. 14] 

May’s prediction for 1995 includes processors capable of 100 megaflops. 
INMOS believes strongly in the idea of balancing computation and communication, 
and May projects that node throughputs will have reached 500 megabytes per second. 
In 1995’s multiprocessor systems, he envisions teraflop performance. By 2000, May 
projects “scalable general purpose parallel computers will cover the performance 


range up to 10" flops. Specialised parallel computers will extend this to 10'° flops.” 
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D. OVERVIEW 


This chapter has surveyed the (relatively recent) history of computing, consid- 
ered the state-of-the-art, and made a few guesses as to the future. Additionally, it 
has introduced numerical and parallel computing. This serves as a backdrop for the 
remainder of the thesis. Chapter I] expands the background on parallel processing 
and numerical methods. The latter provides a lead-in to the specific algorithms and 
theory that appear in Chapter II]. Chapter IV introduces the parallel design and 
methods used in the work. A description of the environment, tools, and equipment 
appears in Chapter V. Results and conclusions appear in Chapters VI and VII. 

Appendices are provided to keep the chapters concise and focused. The ap- 
pendix material operates on both sides of that focus. Some of the material is de- 
signed to give sufficient background and the rest—code mostly—is provided for more 
in-depth study. The background material may be obvious to some readers and new 
to others. ] have assumed that the reader has some knowledge of the background 
material. I do not presume that the reader will be familiar with the code. 

To simplify the discussion we must speak the same language. Appendix A 
gives the basic terms and notation used in the rest of the thesis. Next, we discuss 
the machines used to perform the work. While this is the subject of Chapter V, a 
more detailed account is reserved for Appendix B. Appendix C provides a general 
background on interconnection topologies. Emphasis is placed upon the hypercube 
connection scheme. Appendix D describes the process whereby a real-world problem 
is translated into matrix notation. Appendix E gives some information and results for 
communications performance in a hypercube. Finally, Appendix F provides listings 


for most of the code used in the research. 
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Il. BACKGROUND 


Mathematics ts the door and key to the sciences. 


— ROGER BACON 


Chapter I provided a backdrop, showing the state of scientific computing, es- 
pecially parallel and distributed forms, today. In the present chapter, the scope 
is limited to material and equipment pertaining to this research. The thesis work 
deals with methods of conjugate directions implemented upon two contemporary 
MIMD machines. The goal is to zntroduce the theory, machines, methods, and a few 


peripheral issues that will be helpful as background information. 
A. COMPUTING WITH REAL NUMBERS 


As illustrated in Figure 1.1, the speed of computing machinery has risen swiftly 
since the 1940s. This has often been encouraged by substantial advances in tech- 
nology. ‘Today’s multiprocessor machines seem to be maintaining the fast-paced 
growth. Additionally—although precision is a less glamorous business than speed— 
the accuracy of machine solutions has become more standard. This section considers 
some of the principal issues of computing with finite approximations of real numbers. 

We have observed that the history of computing shows close ties to science and 
mathematics. As the design and construction of computers becomes a more spe- 
cialized business—mostly performed by electrical and computer engineers—we still 
find that many of the fundamental requirements are related to scientific problems. 
These problems typically involve mathematics and a significant amount of scientific 


computing applies numerical methods that involve rea] numbers. The trend in com- 
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puter (hardware and software) design is toward abstraction, but from time to time 


we absolutely must understand and work with the underlying, concrete principles. 
1. Finite—Precision 


New problems are generated as the speed of computing machinery improves 
with each generation of machines. One question to be considered is, how reliable 
are the machines and the software that runs on them? This is a constant concern 
in computing. Many scientific problems involve continuous phenomena in the real 
world. Accordingly, we like to be able to represent the real numbers, #, within the 
machine. But, lacking infinite storage, this is impossible. There have been several 
more-or-less reasonable ideas and implementations of approximations to the real 
numbers within the limits of computer storage. Of these, the floating-point concept 
of storage and arithmetic enjoys the most widespread use. 

The Institute of Electrical and Electronics Engineers (IEEE) has established 
the principal standards for floating-point representations and arithmetic. These 
standards make machine arithmetic more predictable. Surprisingly, while they exist 
in much of today’s computing hardware, the standards are not widely understood by 
practitioners. Then, software and applications are sometimes formed in ignorance. 
The title of David Goldberg’s paper [Ref. 15] speaks volumes: “What Every Com- 
puter Scientist Should Know About Floating-Point Arithmetic.” Goldberg is also 
responsible for several other contributions describing floating-point arithmetic and 
the IELE standards. Appendix A of Hennessy and Patterson’s book on architec- 
ture {Ref. 3] is such a contribution. He gives a very useful description of the IEEE 


standards and instruction on how to perform arithmetic operations on machines that 


adhere to the IEEE standards. 
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y. IEEE 794 


Of the four precisions specified by the IEEE 754-1985 standard, this thesis 
uses the double precision format most often (to approximate real numbers) so it 
will receive the most attention. In the C programming language, these numbers 
correspond to the type double. They are floating-point values stored in eight bytes 
(64 bits). The storage representation is illustrated as three components: one sign bit, 
s; an 11-bit exponent, e; and a 52-bit fraction, f. Figure 2.1 shows an example. We 
say that e is a biased exponent. Both negative and positive exponents are stored using 
a range of positive binary numbers biased about (nearly) the middle. Significand or 
mantissa is the name given to the number (1.f). The fraction is a packed form of 
the significand. This means that the leading one of the significand is implicit. This 
is called a normalized number. [Ref. 16] 

All IEEE floating-point numbers are normalized except for the special rep- 
resentations when e = 00000000000 = 0 or e = 11111111111 = 2047. These are 
called denormalized (or subnormalized) numbers. Only the fraction, f, of a normal- 
ized number is stored [Ref. 3: p. A-14]. Figure 2.1 shows a representation of the 
floating-point number, z = 7.0. First, z is shown as it would be defined in a C 
program. The C address of operator, &, is used to indicate the address of z in mem- 
ory. That is, somewhere (namely &z) in memory, there are eight contiguous bytes 
that hold a floating-point representation of x and (for illustration purposes) we can 
imagine the IEEE 754 double-precision representation of z as Figure 2.1 indicates. 

A standard, such as IEEE 754 (and the lesser-known IEEE 854), is not a 
panacea for the finite-precision problem but it lends tremendous support to those 
who would scientifically deal with the problems of finite-precision arithmetic. Pro- 
grams given in the files num_sys.h and num_sys.c (in Appendix F) are of interest 


to those who would explore further. The programs can demonstrate that the actual 
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double x 


&x 


0! 10000000001 | 1100000000000000000000000000000000000000000000000000 


Ss e= 1025 f = 11, 


Interpretation: = (—1*)x 1.f, x 2°710% 
(—1°) x 1.11, x 21025-1023 
Nails 322 
Ill, 


7 





Figure 2.1: IEEE 754 Representation: Double Precision 


order and location of bits in memory may not match the representation of Fig- 
ure 2.1. This reflects practicalities concerning storage and transmission of bytes at 
a very low level in the machine. It is perfectly reasonable (and easier) to use the 


common abstraction of Figure 2.1 regardless of machine implementation. 


B. NUMERICAL ISSUES 
1. The Need 
Consider the problem of determining the area under a bounded function 
f(z) over a closed interval [a,b]. Numerical quadrature (integration) rules such as 


the Trapezoidal Rule or Simpson’s Rule are used to arrive at an approximating (or 


Riemann) sum of many smaller areas within the region. Numerical methods are 
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often used to approximate the solution to a problem. This is no trivial problem. To 
solve it (numerically) by anything other than accident, one must first understand 
the theory and analytical approach. Next, the problem can be translated into an 
algorithm (a plan—usually mathematical in nature—for solving the problem step- 
by-step) which can, in turn, be translated into the sort of language that a machine 
understands. 

This is a relatively simple approximation problem compared to the problem 
of finding the solution to a system of 500 equations in 500 unknowns. Consider the 
(perhaps more realistic) problem of using numerical linear algebra to solve an elliptic 
partial differential equation like the one presented in Appendix D. Numerical con- 
cerns abound in problems such as these. Additionally, many problems in numerical 
linear algebra have time complexities of O(n?) or O(n*) and storage requirements of 


©(n?) so speed is essential. (Appendix A reviews the complexity notation such as 


big—Oh and big-Theta). 
2. Errors and Blunders 


A clear understanding of the differences between errors and blunders is 
important since recognition of the source of error is prerequisite to eliminating or 


reducing them. The terms are introduced in [Ref. 17: p. 1]: 


Blunders result from fallibility, errors from finitude. Blunders will not be 
considered here to any extent. There are fairly obvious ways to guard against them, 
and their effect, when they occur, can be gross, insignificant, or anywhere in be- 
tween. Generally the sources of error other than blunders will leave a limited range 
of uncertainty, and generally this can be reduced, tf necessary, by additional labor. 
It ts trmportant to be able to estimate the extent of the range of uncertainty. 


— ALSTON S. HOUSEHOLDER 
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3. The Issues 


To anticipate—or even troubleshoot—error we must know from whence it 
comes. In [Ref. 17: p. 2], Alston Householder lists the four sources of error that 


were set forth by John von Neumann and Herman Goldstine: 


e Mathematical formulations are seldom exactly descriptive of any real] situation, 
but only of more or less idealized models. Perfect gases and material points do 


not exist. 


e Most mathematical formulations contain parameters, such as lengths, times, 
masses, temperatures, etc., whose values can be had only from measurement. 
Such measurements may be accurate to within 1, 0.1, or 0.01 percent, or better, 


but however small the limit of error, it is not zero. 


e Many mathematical equations have solutions that can be constructed only in 
the sense that an infinite process can be described whose limit is the solution 
in question. By definition the infinite process cannot be completed. So one 
must stop with some term in the sequence, accepting this as the adequate 
approximation to the required solution. This results in a type of error called 


the truncation error. 


e The decimal representation of a number is made by writing a sequence of digits 
to the left, and one to the right, of an origin which is marked by a decimal 
point. The digits to the left of the decimal point are finite in number and 
are understood to represent coefficients of decreasing powers of 10. In digital 
computation only a finite number of these digits can be taken account of. The 


error due to dropping the others is called the round-off error. ... 
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C. MACHINE METHODS 


We would like to somehow characterize the techniques that make a problem- 
solving method “good”. The abilities of machines and people are distinct enough that 
we should not always expect an algorithm for machine solution to mirror the pencil- 
and-paper method of an individual. Hestenes and Stiefel make this distinction, defin- 
ing a hand method as “one in which a desk calculator may be used” and a machine 
method as “one in which sequence-controlled machines are used.” [Ref. 18: p. 409] 
Further, in the same reference, they list the following characteristics that a good 


machine method exhibits: 


(1) The method should be simple, composed of a repetition of elementary 
routines requiring a minimum of storage space. 


(2) The method should insure rapid convergence tf the number of steps re- 
quired for the solution ts infinite. A method which—if no rounding-off errors 
occur—will yield the solution in a finite number of steps is to be preferred. 


(3) The procedure should be stable with respect to rounding-off errors. If 
needed, a subroutine should be available to insure this stability. It should be possible 
to diminish rounding-off errors by a repetition of the same routine, starting with 
the previous result as the new estimate of the solution. 


(4) Each step should give information about the solution and should yield a 
new and better estimate than the previous one. 


(5) As many of the original data as possible should be used during each step 
of the routine. Special properties of the given linear system—such as having many 
vanishing coefficients—should be preserved. (For erample, in the Gauss elimination 
special properties of this type may be destroyed.) 


D. CONJUGATE DIRECTIONS 


Hestenes and Stiefel describe the method of conjugate directions (CD). This is 
a general approach to solving systems of linear equations that uses direction vectors, 
Po, Pi, ..-, to determine how the search for a solution should proceed from step- 


to-step. When the method for determining these vectors is defined, CD becomes a 
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specific method. There are at least two of these specific methods within CD that 
are especially suited to computer implementation: Gauss factorization (GF) and the 
method of conjugate gradients (CG). (Ref. 18: p. 412] 

The term conjugate is clearly an important one for these methods. Given a 
matrix A € #R"*" that is symmetric, we say that two vectors xz and y are conjugate 
if 

x’ Ay = (Ar)?y =0. (235) 
There is an alternative term that emphasizes the role of A in this definition. We also 
say that z and y are A-orthogonal. [Ref. 18: p. 410] 

The method of conjugate gradients chooses its direction vectors, p;, to be mutu- 
ally conjugate (p? Ap; = 0 whenever 7 # j) and in such a manner that p;4,; depends 
upon p;. (A specific formula is given near the end of Chapter III). The Gauss fac- 
torization chooses p; = e;, the 7*” axis vector. [Ref. 18: pp. 412,425-427] 

In this research, the Gauss method gets almost all of the attention, but the 
method of conjugate gradients receives a short overview near the end of Chapter III. 
The theory of conjugate directions is not at all trivial, and the ties of Gauss and 
conjugate gradients to conjugate directions are fairly deep. These issues are covered 
in the work of Hestenes and Stiefel [Ref. 18]. This thesis develops the Gauss method 


from an implementation standpoint. 
E. PARALLEL PROCESSING 


The field of parallel and distributed computing is a relatively new one. In 
one sense, it is quite natural. We perform work in parallel every day. In fact, a 
manager-worker notion is a very useful means to understand the issues of this field. 
The programs developed in this research involve a host or manager and nodes or 


workers. This is often called the workfarm approach. 
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The principal “problem” in parallel computing is communication. Appendix C 
relates some of the considerations. Of course, there are other concerns as well: load 
balancing, problem size (granularity), and so on. These issues, as they apply to the 
this research, are discussed in Chapter IV. 

The bottom line—after all of the design and implementation work—is perfor- 
mance. With multicomputers, as in a workfarm, we are after efficiency so that more 
computing can be done in a shorter time and for less money. Bell is even more 
specific. He believes the multicomputer must offer two key facilities to become es- 


tablished [Ref. 6: p. 1097]: 


e Power that is not otherwise available. 


e Performance for a price that is “at least an order of magnitude cheaper than 


traditional supercomputers.” 


In Chapter VI, we consider results obtained upon two contemporary parallel 
machines. This information helps us to evaluate the potential of MIMD architectures 


in terms of Bell’s criteria. 


Peet i UP 


The terms speedup and efficiency, defined in Appendix A, capture most of the 
interest when we talk about the potential of parallel computing. The principal reason 
for choosing a multicomputer over a single computer is speed. Therefore, we are most 
interested in knowing what kind of speed we can obtain from a multiprocessor system. 


Bell's comments on price are germane as well. 


Speedup and efficiency are both machine dependent and problem dependent. 
Some problems should not be executed on a parallel machine! Suppose, for instance, 
that part of a problem must be performed sequentially. Amdahl’s law is a well-known 
attempt to characterize this problem. Amdahl stated that speedup on P processors, 


S, is limited in the following manner: 


] 
°< F50-TyP <a 


where f is “the fraction of operations in a computation that must be performed 
sequentially, where 0 < f < 1” [Ref. 19: p. 19]. With speedup, S, defined as 
in (2.2) we see that 
] 
im Ss =— 2.3 
j (2.3) 


Poo 
Figure 2.2 shows how this limit begins to take effect as the number of processors, 
P, is increased from zero to 500. The figure is based on Amdahl’s law (2.2) with 
sequential percentages, f, of 5%, 10%, and 25%. 

We can see that Amdahl’s law has some very discouraging news for so-called 
massively parallel computing. The massive part of the term is loosely defined, appar- 
ently meaning “many” processors. But Amdahl’s law may be based upon a faulty 
assumption [Ref. 20]. Consider the following reasoning. Let P be the number of 
processors and consider the following arguments concerning time. Let s be the time 
required to execute the serial portions of a program on a serial processor and let 
p be the amount of time required to complete the parallel work on the same serial 
processor. Using this notation, and normalizing (s + p = 1), Amdahl’s law can be 
restated 

Sap ie ] 


>= 77) Pa - 


Then, if we consider the case P = 1,024 with s < 10%, we see in Figure 2.3, that 


speedup is severely restricted. 
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Figure 2.2: Amdahl’s Law (1 < P < 500) 
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Figure 2.3: Amdahl’s Law (P = 1024) 


G. SCALED SPEEDUP 


These problems with the usual notion of speedup led Gustafson, Montry, and 


Benner to question the validity of Amdahl’s assumptions [Ref. 20: p. 3}: 

The erpression and graph are bascd on the implicit assumption that p 1s 
independent of P. However, one does not generally take a fired size problem and 
run it on various numbers of processors; in practice, a scientific computing problem 
scales with the available processing powcr. The fircd quantity ts not the problem 
size but rather the amount of ttme a user ts willing to wait for an answer; when 
given more computing power, the user erpands thc problem (more spatial variables, 
for erample) to use the available hardware resources. 

As a first approrimation, we have found that it is the parallel part of a pro- 
gram that scales with the problem size. Times for program loading, serial bottlc- 
necks, and I/O that make up the s component of the application do not scale with 
the problem size. When we double the number of processors, we double the number 
of spatial variables in a physical simulation. As a first approrimation, the amount 
of work that can be done in parallel varies linearly with the number of processors 


Based upon this analysis, they present the notion of scaled speedup. They let 
s’ and p’ represent the serial and parallel time spent on a parallel system (inverse of 
Amdahl’s method). So that s’ + p' = 1 and a uniprocessor requires time s’ + p’P to 


perform the task. With these definitions, they define scaled speedup, S’, to be 
= ae eae ae (2.5) 


If we consider the same range of seria] fractions as we did in Figure 2.3, we see that 
scaled speedup is much better than the usual speedup. Figure 2.4 shows the plot of 


scaled speedup. 
H. SUMMARY 


This chapter considers the background necessary to develop the algorithms 
(Chapters III and IV) and implement them (Chapter V). Algorithms are described 


as sequential plans first (Chapter III). The Gauss factorization algorithm is given 
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Figure 2.4: Scaled Speedup 


in detail (Chapter III), including a discussion on the significance of pivoting. The 
method of conjugate gradients receives less attention, but a brief introduction is 
given near the end of Chapter III. The parallel considerations surveyed quickly in 


this chapter receive more attention in Chapter IV. 
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lil. THEORY 


No human investigation can be called real science if 1t cannot be demonstrated 
mathematically. 


— LEONARDO DA VINCI (1452-1519) 


A. SCOPE 


The goal of this research is to demonstrate a parallel method for solving a 
system of linear equations. The implementation targets two contemporary MIMD 
architectures: the Intel iPSC/2 and networks of INMOS transputers. There are many 
methods for solving linear systems. This work concentrates primarily upon Gauss 
factorization (GF), but the method of conjugate gradients (CG) is also introduced. 
Regrettably, CG is not developed due to time constraints (the derivation is not 
trivial). This does not imply that Gauss factorization is superior, nor that it possesses 
greater potential for parallel solution. Indeed, Hestenes and Stiefel preferred CG to 
GF for a number of very good reasons [Ref. 18: p. 409]. 

As we shall see, the utility of either method is quite dependent upon the nature 


of the particular problem. Consider the system of linear equations represented by 
Au = b. (3.1) 


Much of the subsequent discussion applies to general, rectangular systems where 
A€é &#™*". For the examples, however, square systems (A € #"*") are used. This 
restriction greatly simplifies the discussion without losing much of the concept as 
it applies to general systems. The Gauss process, i.e., the main part of the work, 
excluding the stopping criteria and interpretation of the result, is the same in al] 


three cases (m <n, m=n,andm>n). 
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To be sure, the three cases (m <n, m =n, and m > n) correspond to funda- 
mentally different real-world systems, but the algorithms for each case are almost 
identical. The restriction to a square system will greatly simplify the discussion 
without blinding us to the general, rectangular case. The extensions to the general 
case are well known. Golub and Van Loan [Ref. 21: p. 102] give more detail, but the 
square case is most expedient for now. Square systems also simplify the experimental 
procedure, data collection and analysis. 

The Gauss method follows naturally from a hand method and it holds strong 
appeal to intuition. Without a pivoting strategy, however, Gauss can attempt division 
by zero. There is also a more subtle issue of rounding errors within the limits of 
finite-precision arithmetic. To forestall errors of both kinds, partial and complete 
pivoting strategies are used. This chapter develops the (sequential) algorithms and 
explains the concept of pivoting. This is a sensible starting point for Chapter IV, 


where parallel versions of the algorithms are given. 
B. APPROACH 


There are many methods that may be applied to determine the solution of a 
system of linear equations. The methods were designed for different reasons and 
with different problems in mind, so each exhibits a unique behavior. One method 
is often preferred over another for a given problem. Ultimately, the criterion is 
performance, both in reliability and speed. The approach described here and in the 
remaining chapters seeks to “maximize performance” while retaining a reasonable 
balance of both efficiency and quality. Speed and numerical accuracy tend to oppose 
one another so we are left to choose from several options. 

A hand method introduces each algorithm. The example is small and concrete. 
Solving a small problem gives useful insights into the algorithms. Once the hand 


method is established, it is expressed in an equivalent matrix notation. A high-level 
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sequential algorithm is built upon this foundation. This algorithm shows how a 
machine, using a sequence of instructions, solves the problem. It also gives good es- 
timates for the problem’s time and storage complexities. The sequential-to-parallel 
transition involves enough issues to warrant separate coverage. These considerations 
appear in Chapter IV. 

In the sections that follow, Gaussian elimination is presented first. It reveals the 
background (sort of a first pass) for Gauss factorization. Once the reduction process 
is understood, we proceed to factorization. A description of the method of conjugate 
gradients is given at the end of the chapter. This method, due to Hestenes and 
Stiefel, is based upon relatively deep theory. Thus the derivations and background 


are not included. Nevertheless, a synopsis of the method is given. 
C. APPLYING THE METHODS 


A particular method is often tailored to a specific type of system. The method 
of conjugate gradients, for instance, is usually used when the matrix of coefficients, 
A, is symmetric and positive definite {Ref. 18: p. 411]. The Gauss factorization 
algorithm is equally important, but it takes quite another approach to solving this 
system. Both CG and GF lie within the broad category of methods of conjugate 
directions (Chapter II). Indeed both work in just about any case. But, the better 
results are obtained by using the tool that fits the task at hand. 

A very rough characterization of the problem can simplify algorithm selection. 
We will look for two qualities: structure and density. CG, for instance, performs 
best when applied to highly structured, sparse matrices (i.e., matrices with many zero 
entries). Systems like the sparse, symmetric, highly—structured result of Appendix D 
deserve careful solutions that do not destroy the existing zeros. Zeros are not always 


easy to come by. Gaussian elimination must expend 2n°/3 flops to create them. 
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Selecting the wrong algorithm can lead to slower execution. More importantly, 
poor algorithm choice is a blunder (Chaper IJ). It can produce results that are ac- 
cidentally perfect, grossly incorrect, or anywhere between. Therefore, no less than 


three tasks confront us: 


e Characterize the problem. In systems like (3.1), attributes of the matrix of 


coefficients, A, may provide a wealth of information. 


e Understand the algorithm(s). Know the types of problem(s) it is designed for 


(and, more importantly, know why). 
e Create or select an algorithm that suits the problem. 


The sparse, highly-structured problems are not rare! Anyone who has observed 
nature knows that many natural phenomena exhibit incredible structure and sim- 
plicity. Strategies for solving the corresponding system should always seek to exploit 
these characteristics. Both sparseness and structure can reduce storage requirements 
and the number of flops required. If we know the structure in advance, there may 
be a smart way to avoid some calculations entirely or minimize the work involved. 
(Recall Hestenes and Stiefel’s characterization of a “good” machine method from 
Chapter IT). Other problems, when translated into the form (3.1), exhibit a dense 
matrix, A, with little or no apparent structure. 

These two types of problems should not be handled with the same tools. As 
with many computational problems, the reasons involve the use of time and space. 
We shall see that the Gauss algorithm has time complexity O(n°) and storage re- 
quirements O(n’). (Complexity notation appears in Appendix A). Numbers like 
these grow rapidly with n and, regardless of how much memory is available, the 
problem can quickly overpower the computer. A naive approach to problems of 


these kinds can be expensive in terms of both storage and time. This is usually 
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adequate incentive to take advantage of sparseness and structure whenever possible. 


When it is not possible, Gauss 1s a good choice. 
D. GAUSSIAN ELIMINATION 


Suppose that we want to solve a system of linear equations using a systematic, 
step-by-step method. We assume that the system of linear equations is given, and 
that the method must preserve the original properties of the system. That is, the 


method must be restricted to certain operations; namely: 


e Multiply an equation by a nonzero constant. 
e Interchange equations. 


e Add a multiple of one equation to another. 


The fact that the first two operations do not change the system’s properties is ev- 
ident. The third operation is legitimate also—maybe not quite so obviously—and 
computationally, the most significant. Now let us apply some of these operations to 


a system of four equations in the four unknowns, v), v2, v3, and v4. 


2v, + 3v2 + 403 + 504, = 0 

4v, + 62 + 83 +5, = —d (3.2) 
2v, + 4v2 + 703 +90, = 13 
60, + 8v2 + 83 +90, = —-l17 


Let m (= 4) be the number of equations, and let n (= 4) be the number of unknowns 
in each equation. Additionally, let 7 be an equation (or row) index (1 <7 < m) and 
let 7 indicate a subscript of v (column index) sothat 1 <j <n. Finally, let a;; be the 
coefficient of v; in equation 2 (e.g., @j2 = 3). Suppose that the last equation contains 
only one nonzero coefficient (say a44) and the third equation has only two nonzero 
coefficients (a@33 and a34) and soon. This defines a triangular system (Appendix A). 
The triangular system is our goal because it is easier to solve (by back substitution) 


than the current (square, dense) system. 
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Next, observe that a triangular system would result if we could eliminate every 
coefficient, @,;, of v; in all equations but the first (2 > 1), coefficients, aj2, of v2 in 
the last two equations (2 > 2), and the coefhcient, a43, of v3 in the final equation. To 
do this, we work by stages. At stage k, the coefficient, ax,, of vg in the k** equation 
is called the pivot. This term has httle significance now but is clarified later (and 
it plays a very important role in the examples presented. In a particular stage, k, 
the goal is to operate upon all equations 7 where 7 € {(k + 1),(k + 2),...,m} and 


eliminate all coefficients, az, of vz. 
1. A Hand Method 


Before attempting to describe an algorithm for a machine solution, we con- 
sider an application of Gaussian elimination (GE) by hand. Inittally, let & = 1. In 
the example system (3.2), the first (4 = 1) pivot is the coefficient, ay, = 2, of v1 
in the first equation. Notice that by subtracting twice the first equation from the 
second, a zero is produced under the pivot (eliminating @2;). Similarly, by subtract- 
ing the first equation from the third, a zero appears as the leading coefficient in the 
third equation (chiminating a3,). Finally, three times the first equation subtracted 
from the fourth equation eliminates the coefficient ay,;. Following these steps the 


altered system ts: 


204 + 3U2 + Avs =| Oy = 0 
U9 =F 35 + 4 Wy = 13 
—W — 4 vy Ory = -l7 


This is called the natural reduction process [Ref. 22: p. 72]. In the particular case, 
there are no changes on the right-hand side because the first equation’s mght-hand 
side is zero. This makes for trivial arithmetic on the right-hand side, but we should 
remember to perform the arithmetic upon whole equations (including the right-hand 


side) in general. ‘The elimination is even more successful than planned. 
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The second equation already has zeros where we ultimately wanted them 
in the fourth equation. That is, the system (3.3) would be closer to upper triangular 


if we were to alter it by interchanging equations 2 and 4. 


20, + 302. + 403 +50, = 0 
—v2 — 403 —6v4, = —17 

vo +3u3+40, = 13 (3.4) 
—5v4 = —95 


The system (3.4) is called a row permutation of (3.3). The ability to recognize 
patterns is a great advantage that human problem solvers enjoy. Therefore, taking 
advantage of our capabilities we use a rather subjective “human” pivoting strategy. 
But it is not fitting to assume that an efficient algorithm for a machine would involve 
the same sort of pattern recognition. 

The system (3.4) is nearly triangular. The pivot moves to the second equa- 
tion (k = 2), and we focus on the coefficient, ag2 = —1, of vu, = v2. By adding 
the second equation to the third, the only nonzero coefficient remaining in the lower 


triangle (a32) is eliminated. The resulting system becomes 


2v, + 3v2 + 403 +50, = 0 

—v2 — 43-6, = —-lf 
ane 85) 

—5v, = —5 


The system is triangular, and it is easy to solve for the unknown values, v;, by back 
substitution. By inspection, v4 = 1. Substituting this value into the third equation, 
we find that v3 = 2. Substituting both values (v, and v3) into the second equation 


yields v2 = 3. Finally, by substituting the values v4, v3, and v2 into the first equation 


gives v; = —11. The solution to the system is then 
Vv] dl 
= V2 = 3 
v= he > |: (3.6) 
U4 1 
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2. A Machine Method 


The foregoing example illustrated the GE process as done on paper. The 
system was intentionally created for easy solution by hand calculation. I.e., it uses 
integers and elimination occurs faster than the usual case. Even this simple example 
requires a few minutes to determine u from the system (3.2) by hand. In Chapter 
VI, we see that a machine can perform this task in (much) less than a second. For 
this reason, it is worth examining an equivalent process to solve for such a system 
by machine. 

We reenact the solution from the beginning, this time in a fashion that 
a sequence-controlled machine could perform. Until now, we have used the term 
“pivot” but have found no practical use for pivots. In this example, we begin to 
realize the utility of a pivoting strategy. We start with “no pivoting” and shift to 
the “partial pivoting” strategy. Additionally, we begin to use a more compact matrix 
notation. Appendix A describes the notation followed. 

By the method described in Appendix A, we give the linear system (3.2) 


matrix representation that corresponds to (3.1): 


23S v} 0 By 
_|4 6 8 6 v2 | _ —5 | |} fo} _ 
AM 24 7 9 v3 | el eee. ae (3.7) 
6 8 8 9 U4 —17 Bs 


First, we initialize a stage counter, k, so that k = 1. The pivot in stage k 1s ayy, on 
the diagonal of A (a); = 2). The immediate goal is to produce zeros beneath the 


pivot, in A(2:4,1). A three-step process eliminates these coefficients in row order: 
e Divide. Divide every element beneath the pivot by the pivot value. 
e Update. Perform arithmetic in the Gauss transform area. 


e Eliminate. Set the elements beneath the pivot equal to zero. 
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The first step is a division. The denominator (pivot) is ax, = a); = 2 so 


G2, becomes the multiplier (a2,/2) = 2. Similarly, let a3; =1 and let a4; = 3. Now 


D4 5 
2685 

Se OH), 
eee Ge 


Next, consider everything below and to the right of the pivot. This is the Gauss 
transform area, G = A((k+1):m, (k+1):n) = A(2:4,2:4). For each element in 
G, replace the current value, a,;;, with a;; — (ajx%)(a%;). Do the same thing in the 
corresponding rows (7 > k) of b, replacing 2; with 8; — (aix)(,). We will call this 
the process of performing arithmetic in (or updating) the Gauss transform area, G. 

Finally, when the values beneath the pivot are no longer needed, eliminate 


them (set them equal to zero). The result is equivalent to the system (3.3): 


2 3 4 2 Vv) 0 
(eee cr acon ime) —5 
0 —-1 -4 -6 U4 —17 


We have finished one stage of GE. We move into the next stage, k = 2. This time, 
when we try to update G we run into a very serious problem. The first step is to 
divide everything underneath the pivot by the pivot value ay, = a22 = 0. This is 
the divide-by-zero problem of a “no pivoting” strategy. 

During the execution of the hand example we simply moved the row to the 
bottom of the system to avoid this problem. Now, we could instruct the machine 
to test every element in A(k:m,k:n) and interchange rows so that those with 
the most leading zeros were placed at the bottom. This is problematic for several 
reasons. First, it is not dependable (testing for equality of floating-point numbers 
begs disaster). Secondly—even if we could identify zeros with confidence—it would 
add a sorting problem to GE! We are not looking for extra work. The solution is 


partial pivoting. 
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3. Partial Pivoting 


Partial pivoting is an application of row interchanges to eliminate (primar- 
ily) the divide-by-zero problem. Consider the system of equations (3.1) with the 
nonsingular matrix of coefficients, A € R™*” (1.e., m = n and the system has exactly 
one solution). Suppose further that storage and arithmetic is performed in infinite 
precision. (These assumptions—infinite precision and A nonsingular—are essential). 

Even in this ideal situation Gauss without pivoting is dangerous because, 
as we have just seen, it may attempt to divide by zero. Proper row permutations 
completely eliminate this problem. Partial ptvoting will guarantee the existence 
of n nonzero pivots for A nonsingular. In fact, if we encounter a zero pivot with 
partial pivoting, it means that A is singular [Ref. 23]. The remainder of this section 
describes the partial pivoting strategy. 

Consider stage k of the GE process with A € R”*". The goal is to pick 
the “best” row remaining (i.e., at or below the current pivot) and install it as row 
k, the pivot row. For reasons that are explained later, “best” shall mean the row 
whose k* (pivot column) element is largest. Let s be the row index for the best 
pivot candidate. Initially, let s = k (i.e., az, is the first candidate). Next, we move 
down the pivot column, considering all a;, where 2 > k. 

To eliminate unnecessary assignments, we replace the current candidate 
with another only if |a;,| > |a,,|. When this occurs, we make sure that s is updated 
by setting it equal to 2. After considering all elements, a;,, for k < 1 <m, s is the 
index of “best possible” pivot row. To accomplish our goal, we must perform a row 
interchange. This is easy after the new pivot row has been determined. We simply 
swap rows k and s (if 4 # s). Within the assumptions above, we have completely 
eliminated the potential for division by zero. Now let us return to the problem at 


hand. 
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4. A Machine Method (Resumed) 


Applying partial pivoting to the system (3.9), we find that the next pivot 
is located at A(3,2) so we must interchange rows (equations) two and three. Be- 
fore performing this step, however, let us create a vector to keep track of the row 


permutations. Let ¢q € R™ be the row permutation vector. We initialize g so that 


= 2: 
Wy ] 
y Z 
LS | ne le Le (3.10) 
W4 4 


and perform row interchanges in qg corresponding to those in A so that y; 1s always 
the origina] equation number for current equation number 7. Thus, after performing 


the row interchange, we have 


eee apis: 0 
ieee, |_| 13 | 3 
Jenene alle at 4 ae =o (3.11) 
) of eA eth, 17 4 


Notice that y3 = 2 indicates that the third equation in (3.11) was the second equation 
in the original system (3.7). Now, since a3, = 0, no arithmetic is required in the 
third row. In row four, the arithmetic will be equivalent to the notion of adding (the 


current) equation two to equation four. The result is 


ats 4 oO Vv] 0 
On) 5 4 v2 13 

= 2 
0 0 OO —5 V3 —5 | cae) 
0) 0 —] —?2 U4 —4 


When we move the pivot index to the third equation (k = 3), we notice that a33 = 0. 
The divide-by-zero problem has resurfaced. Once again, we pivot, swapping rows 


three and four. After this, we have 


23 4 2 Vj 0 ] 
Do 8 4 | ee 3 
0 C=) 2 oe |) Bj 7 | 4 (3.13) 
0 0 0 —5 U4 —9 2 
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The zero beneath the final pivot obviates the need for further arithmetic. The trian- 
gular system (3.13), found by our machine method, does not look like the system (3.5) 
from the hand method because we did not perform the same row interchanges. If we 
had maintained a row permutation vector, q, for the hand method we would have 


noticed that 


Ww & 


= §. (3.14) 


bo 
to 


Of course, back substitution for the final (triangular) machine system (3.13) yields 


the same solution 


Vv) —1] 

= V2 = 3 
ae al lie 9 (3.15) 

U4 ] 


as that of the hand method. Thus, even though we used different permutation 
schemes, the “pivots” in both cases were always nonzero and the solutions were the 
same. This 1s not surprising, since A 1s nonsingular and row permutation is merely 
the practice of interchanging equations. 

Let us review first the process and then the theory of Gaussian elimination. 
The GE process performs a systematic elimination of the lower (in our example) 
triangle of a matrix of coefficients, A. Arithmetic operations are performed upon 
entire equations at the same time (including the right-hand side, 6). In other words, 
during stage k of the process, arithmetic operations are performed upon (portions of) 
all rows 2 (2 > &) of A and upon all elements (rows) 6; (for 7 > &) of the right-hand 
sides, 6. The process depends upon both A and 6 and both of them can be changed 
substantially. 

The idea behind Gaussian elimination is that general square systems are 
difficult to solve, but triangular systems are easy. The goal is to transform a general 


matrix A into triangular form, performing legitimate arithmetic upon entire equa- 


46 


tions (including the right-hand sides). Reduction to triangular form costs 2n3/3 
flops. Once A is reduced to triangular form, back substitution yields a solution for 
the unknown, u, in n? flops. Thus GE solves a general, dense, square system of n 


equations in n unknowns by the application of 2n°/3+n? flops. [Ref. 21: pp. 88, 97] 
E. GAUSS FACTORIZATION 


Gauss factorization (GF) is a well-known method for solving linear systems 
like (3.1) that (simultaneously) factors A. GF has strong ties to the GE process. 
Those ties will become evident as we develop the same example over again, this time 
using the GF bookkeeping and method. GF holds several major advantages over GE. 
Among these: A is recoverable (the process does not destroy it) and the process is 


independent of the right-hand side, b. In fact, b is not used in the factoring process. 
1. Complete Pivoting 


The complete pivoting strategy will be applied in this example. There is no 
special significance behind the introduction of complete pivoting with the GF process. 
Either strategy—the choice of a “no pivoting” strategy is also available, but not 
generally acceptable for serious problems—can be used with GE or GF. The complete 
strategy is a straightforward extension of the partial strategy, so introducing partial 
pivoting first was practical. 

With complete pivoting, row interchanges are still allowed, but so are col- 
umn interchanges. We will continue to use g € R™ for row interchange bookkeeping. 
The vector p € ®", similarly, will maintain the column permutation information. We 
search not just the pivot column, but the entire Gauss transform area, for the next 
pivot. This takes longer but generally produces better solutions. The numerical dif- 
ferences between partial and complete pivoting involve some difficult error analysis. 


These issues will be addressed briefly after we complete the examples. 
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2. Example 


Now the GF process is demonstrated. We start with the same system of 


four equations in four unknowns: 


2v, + 3v2 + 403 + 5, = 0 

4v, + 602 + 803 +5ug = —5 (3.16) 
20, + 4v2 + 7v3 + 9v, = 13 
60, + 8vu2 + 8u3 + Qu, — ae 


and proceed immediately to the matrix of coefficients (the factoring part of GF 


concerns itself with A only). 


2345 
468 5 

iy id (3.17) 
68 8 9 


a. Stage Zero 


For the initial stage, k = 0, let the Gauss transform area be G = A. 
Also initialize pivot indices s = t = 1. The sole purpose of stage zero is to find the 
first pivot. Initially, we guess that the pivot is a,,;, located at A(1,1), the upper 
left-hand corner of G. (This is the position where the new pivot will be installed). 
Accordingly, we set row and column indices, s = 1 and t = 1 to keep track of the 
best pivot candidate. 

Indices s and ¢ are changed only when we find a superior candidate for 
the pivot. To begin the column-by-column search for the pivot we move down the 
columns in order from left to right and through each column in a top-to-bottom 
manner. When we have considered every element in G, we know that the next pivot 
is currently situated at A(s,t). 

For the current example, as we move down the first column of G, the 
values of s and t are adjusted twice. A better pivot candidate is found, first at A(2, 1), 


and next at A(4,1). The indices are adjusted again in the last row of column two, 
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where the value, 8, is larger than the value of the current candidate, 6. Column 
three has no candidates larger than 8, so we do not adjust the indices again until we 
find the 9 at A(3,4). Thus s = 3 and t = 4 have located the next pivot according 
to a complete pivoting strategy. This accomplishes the goal of stage zero. Now we 


specify the process for each of the remaining stages. 


b. Outline of the GF Process 


For each stage, k, of GF, we shall perform the following steps: 


Locate the pivot according to a pivoting strategy (none, partial, or complete). 


If complete pivoting is used, search all of G for the next pivot. 
e Increment the pivot index, k. 


Perform any row and/or column permutations that are required to move the 


pivot into the position A(k,’). Update p and q accordingly. 


Divide every element beneath the pivot by the pivot value. 


e Redefine the Gauss transform area so that G= A((k+1):m, (k+1):n). 


Perform the appropriate arithmetic in G. 


Let us return to the example and exercise the process. 


c. Stage One 


Since stage zero has already located the first pivot, the first step of 
section b is not necessary in this stage. We increment k (to & = 1) and install the 
pivot A(3,4) at A(k,k) = A(1,1). This means that rows 1 and 3 must be swapped. 
Columns 1] and 4 must be swapped in addition. The permutation vectors, p and q, 


record the interchanges. 
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After interchanging rows and columns, we have 
2 


4 3 

2 2 
A= 3 1 (3.18) 
N 


4 


cOuen ci co 
COW Ds 
on on7 
Om s 
s 
| 


Now we perform the division beneath the pivot, producing the multipliers in the 
lower three rows in the leftmost column of A. When this is done, we perform the 
arithmetic in G = A((kK+1):m, (k4+1):n) = A(2:4,2:4). For GF, we do not 
replace the multipliers with zeros. We shall find that the multipliers are very useful 


in the end. The result is 


oF 4. i 
5/9 34/9 37/9 26/9 
5/9 7/9 1/9 8/9 

1 4 1 4 


A= (1G) 
Next, with G being the lower right (3 x 3) block of A, we search G for the next pivot 
and find that A(s,t) = A(2,3) holds (37/9), the largest second pivot candidate. 


d. Stage Two 


We increment the stage counter (4 = 2), so that it points to the new 
pivot location, A(2,2). Since s = k, we know that no row interchange is necessary 
and g will not change. We must, however, swap columns & = 2 and ?¢ = 3. The result 


1S: 


ys 4 3 
— | 5/9 37/9 34/9 26/9 ees gales 
A=1519 1/9 7/9 8/9 Slo | i eat) 
Lo 4a 1 4 


Once again, we divide everything under the pivot by the value of the pivot and 
update G. This yields 


one 4 2 

5/9 37/9 34/9 26/9 
5/9 1/37 25/37 30/37 
1/2 aaa oeetoia7 


A= (3.21) 
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e. Stage Three 


Now G becomes the (2 x 2) lower right block of A and the next pivot 
(122/37) is found at A(s,t) = A(4,4). Since k = 3 we must interchange rows 3 and 


4 as well as columns 3 and 4. The result of the permutation is 


9 ff fe 4 4 3 
an . se es riven = : i= oe) 
5) Sel seo 0) 3 ee 20 oT 2 ] 
Then, dividing at the bottom of the pivot column and updating G, we have 
9 7 2 4 
ee 5/9 37/9 26/9 34/9 (3.23) 


1 9/37 122/37 114/37 
5/9 1/37 15/61 —15/183 


f. Stage Four 


The final stage, where k = 4 = min(m,n), is always trivial. We need 
only to verify that a44 1s nonzero. This tells us that, indeed, A is nonsingular. There 


is no arithmetic to perform, so (3.23) is the final, factored, copy of A. 


g. Summary 


Using the Gauss factorization process we have systematically trans- 
formed the matrix A € #*4*4 into a form that factors the original version of A. At 
this point the factorization itself has not been discussed, only the process whereby 
we claim to have factored A. Before we explore the resulting factorization, let us 


consider—in a general way—what happens in any stage, k, of GF. 
3. One Stage of Gauss Factorization 


The most important part of GF is the factorization that it produces. 


The GF process is reversible (pivots and other key information become part of the 


3] 


factorization). This section—using block matrix notation and induction on the stage 
number—illustrates the effect of one stage of GF. The proof shows that we can 
perform an n-step Gauss factorization A = LRA, with L unit lower triangular and R 
right (upper) triangular with nonzero diagonal elements. Before the proof, however, 
let us consider a concrete illustration where n = 15. 

Let ® denote those elements that Gauss has fizedin both value and position. 
The x symbol marks elements that are subject to permutations but not changes in 
value. Those elements that are subject to both permutation and changes in value 
are indicated by the © symbol. Elements in the pivot row are marked with the © 
symbol and the symbol @ denotes elements beneath the pivot. White space indicates 
zeros, a is the pivot, and any p,; was a former pivot (in stage z). Let k = 7. Then 
the leftmost 7 columns of AR; are already fixed in upper triangular form and L; is 
unit lower triangular with the special form described above. Upon entering stage 
(k+1) = 8 of the Gauss factorization process, the matrices Lz and R; would appear 


as shown below: 


] 
ee 
© © 1 
© © © 1 
© © © © 1 
® © © ® © 1 
8S © 8 © @ € 1 
ios | € (eee ewes) = 1 (3.24) 
Xx" Xx xX ee =X ] 
xx xX OE ] 
KX x Xe ] 
X < xo ] 
xX xX BN we l 
SS eee ex. ex Se ] 
lal a a a Mr l 


aes 
te 


Pr © © B BO B OB BO X X XK KX XK XK KX 
ppm POI Oo xX ox ex x x OK 
Ps ®@ &@ @ &@& WB XK KX XK KX KX XK *X 
Pa ® @ @O @O@ X X X X KX XK X 
Pe BOS OY UK XK OX oe xX 
Hoa SUX XK Xs XX xX Xx 
De HO Ce Xe OK OX EX 

Rh; = CPE) 4) Ne sey se UG 1S (3225) 
RO OS nae 1 1G: oe 
OP ge Wie “1 Ke) 1) 
er Ie Ae 3. AE) 
BAe BO HOO. Tes ey, 
BO CO Cue OR Oh. 
QBS Ae 1 Mee Oo 
Gore? SO) TOV IC Oe 


With this illustration in mind, let us prove the effect of GF. 


Proposition: Given A € R°™””. Let L; € R"*” be the unit lower triangular matrix 
with J,_;—the (n—2) x (n—2z) identity—as its lower, right-hand block. Let R; € R°*” 
be the matrix that is upper right triangular in its leftmost 2 columns. Initially, let 
A = [gRo with Lo = I and Ro = A. Let P(k) be the proposition: “Stage k of the 


Gauss factorization process yields the factorization, A = L,R,.” 


To Show: P(k) > P(k4+1) forO<k <(n—1). 


Assumptions: Pivoting, according to any valid strategy, is performed outside of 


this factorization procedure and the pivoting strategy yields pivots, a # 0. 
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Notation: We can partition A so that 


A= | ae | (3.26) 


where a € & is the initial pivot, zr € R"-! holds the values beneath the pivot, 
y € R"' holds the values of the elements in the pivot row to the right of the pivot, 


and G € R("-1)x(2-}) is the Gauss transform area. 


Basis for Induction: We must show that P(0) = P(1). P(0) means that Lo = I, 
and Ryo = A. That is, Ro has no special structure except (by assumption) we are 
guaranteed a nonzero pivot a. Consider stage k = 1 of Gauss factorization. Let us 


partition A as above and factor 


a=(2 8 ]H[ET Lh b= 820 


where B, f, r, and p (with the obvious sizes) are defined as 


Dp — 6 (3.28) 
I (3.29) 
[= (“| il (3.30) 
p 
B=G-0' (3.31) 


Thus, given A = [Lo Ro, Gauss factors A = L,R, and P(0) > P(1). 


Inductive Step: Consider the matrices L; and R, that are submitted to stage 
(k+1) of a Gauss factorization procedure. We make the inductive step to show that 


P(k) > P(k +1). ForO<k <n, A= L,R, may be partitioned so that 


£. 60: 0 ce de 
A=1|m’ 1 0 07 a y? |=L,R, (3237) 
N OT 0 «x G 














where L € R*** is a unit lower triangular matrix and R € R*** is a right (upper) 
triangular matrix with nonzero diagonal elements. 

The Gauss process forms p as in (3.28), r as in (3.29), multipliers, @ as 
in (3.30), and B as in (3.31). Then, for 0 < k < (n — 1), GF forms 


jae at eS ae 
Aes vCal Spree Ne, ee. - (3233) 
WO ae es 


ims, ior 0 k <n, P(k) > P(k +1). [Ref. 24) 

| 
Conclusion: The nonsingular matrix A € #"*" can be factored, in n steps of the 
Gauss factorization process, so that A = DR with L being unit lower triangular and 
R being upper triangular with nonzero diagonal elements. 

The proof has demonstrated the effect of GF. For simplicity, it excluded 
the pivoting strategy (simply assuming that, at every stage, a pivot a 4 0 would be 
available). It also held A square. In this sense the proof is somewhat specific. There 
is a more general conclusion to be made. This conclusion holds for GF with pivoting 


and 0 #4 A € R™*” and it is absolutely essential to understanding the factorization. 


4. The LR Theorem 


With the GF process complete, and the vast majority of the work done, 
we show how to form a solution from our factorization. Various methods of pivoting 
(resulting in permutation vectors) and the method whereby A is factored have been 
discussed. To solve the system, we must put all of this information together. The 


key is the LR Theorem [Ref. 24]: 


Theorem 3.1 (LR Theorem) Let 0 # A € R™*". Then there are permutation 
matrices P € R"™*" and Q € R™*™, an integer r > 1, a lower trapezoidal matriz 
LE R™*" and an upper (right) trapezoidal matrizr R € R"*" so that Q7AP = LR. 
The diagonal elements of L satisfy \;; = 1 with 1 = 1,2,...,r and the diagonal 
Bienciiso| hk satisfy p,, = 0 fori =1,2,...,7: 
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5. Filling in the Blanks 


a. The Main Factors 


GF used the space of A to hold the two principal matrices, L and R, 
in the factorization of A. To see them, we will extract the lower triangular matrix, 
L, and upper (right) triangular matrix, R, from the final copy of A (3.23). Initially, 
let L = R=0. We form L by placing ones on its diagonal and filling the elements 


below the diagonal from the corresponding locations in A. 


] 0 0 0 


ati jh a Ci) 
5/9 1/37 15/61 1 
Ris formed with the diagonal elements (i.e., pivots) and upper triangle of A. 
ay i 2 4 
jae 0 37/9 26/9 34/9 (3.35) 


0 O 122/37 114/37 
0 0 0 15/183 


b. Permutation Matrices 


The bookkeeping allows us to construct P and Q very quickly. To form 
P € R"*", we set every column, 7, in P equal to the axis vector implied by 7,, the 
j'® element of p. This yields the permutation matrix, P, that will satisfy the LR 


Theorem, namely 


4 4 0 0 1 0 
3 000 1 

p= an = 1 —- P — | Ca 2 "Ci <9 = 0 1 0 0 (3.36) 
T4 Z 10 0 0 


Similarly, every column, 7, in Q € R”™*™ is set equal to the axis vector implied by 


~;, the 7 element of g. For our example, we have 


Y 3 om) Oneal 
Bee) | ae 7 _{0 10 0 

fo ws | | 4 => Q= €3 €2 €4 €1 ~11] 000 (3.37) 
u4 i 0 0 ] 0 
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c. Check 


Now we check to make sure that our solution satisfies the LR Theorem. 


First, consider the product LR: 


1 0 oO ol1T9 7 9 4 
15/9 1 0 o||0 37/9 26/9 34/9 
et | 190/37 114/37 (S258) 

5/9 1/37 15/61 1} [10 0 0 15/183 
9724 
ere 
eae (3.39) 
5 4 2 3 
And 
TS in ee in a 
r.p.(0100/14685{lo001 
VAP=)9 9 0111247911010 0 (3.40) 
1ooo0ll6ssoa|lio0o0o0 
2479170010 
eee ome ono 0 1 
=(QAIP=1 6 gg 9/1010 0 (3.41) 
ae ae 
a Oe 
5 8 4 6 
=lo 868 (3.42) 
5 4 2 3 


Our factorization satisfies Q7 AP = LR. 


d. Solution 


Now we solve the system. Recall that Gaussian elimination operated 
on the matrix, A, and the right-hand side, b, at the same time. The end result of 
GE is that A is reduced to upper triangular form by successive elimination of the 
lower triangle so that we could solve for u with a relatively easy back substitution. 

The strategy of Gauss factorization is different. First, 6 is not part of 


the factorization process. Secondly, even though we are changing A, we know that 


of 


we can get it back at the end (if we want to), so there is no need to save the original 
A. Now, using the LR Theorem, we complete the solution. Recall that the original 
system was 


Avi (3.43) 


The factorization process constructs permutation matrices P and Q and transforms 
the original matrix A into a combined version of L and R. Further (by the LR 


Theorem) we know that these matrices satisfy 
QTAP = LR. (3.44) 


Now, by multiplying (3.44) through by Q from the left and P? on the right, we see 
that 
QQ’APP! =QLRP’. (3.45) 


Performing the cancellations on the left-hand side, we have 

INTO ie. (3.46) 
This is the factorization of A. Substituting this into (3.43) yields 

QLRPlu=b (3.47) 


OT 


LRP? u=Q". (3.48) 


Now let 6 = Q7b and let & = PTu. Then 


LRi = b. (3.49) 


bs 
io) 
lI 
oO 


(3.50) 


Since we know L and 6, we may solve for c by a simple forward substitution. Then, 


using c and knowing that Au = c, we perform a simple back substitution and deter- 


mine wu. Finally, by definition, a = P7u (i.e., @ is a mere permutation of wu) so we 


can swap elements in wu to arrive at u using Pu = u. 


Let us summarize this lengthy process into the main steps. The GF 


process factors A = QLRP’, changing the general matrix into a product where the 


most significant factors are both triangular. This reduces the hard problem to two 


easy ones. It is designed so that we can solve for u in two steps: 


e Solve, by forward substitution, the system Lc = 6 for a vector, c, of unknowns. 


e Solve. by back substitution, the system Ru = c for (a permutation of) the 


original unknowns, uw. 


So, for our example, the first step is to solve 


oe oes 13 B 
= 5/9 ] 0 0 C2 e a —5 _ By mae 
i. S738 ] 0 C3 Sac —17 | Bs =e 


Forward substitution, applied to this system, yields 


ce 1 
pee C2 = —110/9 

C3 —1000/37 

C4 —15/61 


o9 


(3.51) 


(3.52) 


Now we know c, so we can solve the second triangular system, Ru = c for a by back 


substitution 


9 7 Z 4 V1 13 
eo | Oar3 1 SeeZG 0 34/9 V2} _ —110/9 _ 
ao 0 0122/37 1i4ayar v3 | | —1000/37 aut sioe) 
0 O 0 —15/183 U4 —15/61 
which yields 
Vv} ] 
ee aad a Z 
Vi = nal a (3.54) 
U4 3 


Now it is easy to recover u. Since we have defined 7 = P’u, we know 
that Pu = u (a simple rearrangement of the elements that we have already found). 


We apply P to u and find that 


D0 1eOnuiaen By =i 

we eee eee | a) s: 

eR Ce celkdeel | 2 (3:39) 
100 0}L 4%, ui, 1 


Comparing this to earlier solutions, we find that GF has arrived at the same solution. 

In these examples, the notion of elimination was developed first. The 
GE process performs successive eliminations beneath its pivots and reduces A to 
triangular form, and then the solution is available in only n? flops. GF spends 
an almost identical amount of work in the reduction process, but the result is a 
factorization with LZ and R being the significant factors. (They are the only ones 
that are more than a permutation of the identity). In the examples, we used pivoting 
because it was practical. Now let us take a closer look at the justifications for 


pivoting. 
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F. PIVOTING FOR SIZE 


The issue of pivoting is a very interesting and important one. We concluded that 
we must pivot or face the possibility of attempting to divide by zero, an unacceptable 
option. To solve this problem, we may pick any nonzero element in A(k:m,k:n) 
and perform the column and row interchanges required to install it as the new pivot 
(k is the pivot index). There are many strategies that we could adopt. 

The logical question would be something like: “Given that we must pivot, what 
is the best means available?” But the answer is not so easy, and there are many 
trade-offs to be considered. We are faced with choosing along a spectrum, where 
speed lies at one end and accuracy lies at the other. For instance, we could begin a 
search and pick the first nonzero element in this area. Or, we could search for the 
row with the most nonzero elements (that had a nonzero element in the k‘” column). 

The two most common strategies for pivoting are the partial and complete meth- 
ods, which we have discussed. We determined that partial pivoting would work per- 
fectly (with no error) if A was nonsingular and the storage and arithmetic could be 
handled with infinite precision. If infinite precision were available, we could stop 
right here. There would be no need to try to refine the method. In a finite-precision 
machine, however, we must deal with the issue of errors. 

To deal with errors, the problem must be stated more precisely. The errors 
that concern us would arise due to growth of the elements of L and/or FR as we step 
through the stages of Gauss. In the end, partial pivoting guarantees that all of the 
elements of L will be, at most, unity. This is easy to see. The pivoting strategy 
chooses each pivot to be the largest element (in absolute value) in column & at or 
below row k. This value is installed at A(k,’) and everything below the pivot is 
divided by the pivot. 
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Unfortunately, partial pivoting cannot make the same guarantee for the ele- 
ments of R. It helps: the multipliers are less than or equal to one in absolute value. 
The elements of R are bounded by 2"~'a, where a is the largest absolute value of 
the elements in A. This bound is not normally attained “in practice”. [Ref. 23] 

Growth is an indicator of trouble in this process. If we cannot control it com- 
pletely, we should, at a minimum, monitor it. The growth factor, g(n), of a Gauss 
factorization process for A € R"*” is defined as follows. Let a be the largest absolute 
value in the original matrix, A. Let 6 be the largest absolute value that occurs in 
any Gauss transform, G, including the first one, G = A. Then g(n) = 6/a gives a 
growth factor normalized by a (i.e., g(r) > 1). 

A great deal of analysis has been done on this subject. Wilkinson showed 
that, with complete pivoting and real matrices, g(n) grows much more slowly than 
2”. He conjectured that g(n) <n. The latter has recently been disproved, with a 
counterexample by Nicholas Young. [Ref. 23] 

As a practical matter, when one seeks to monitor growth one uses complete 
pivoting. ‘To consider performance, one uses the partial pivoting strategy. The 
growth factor, g(n), is easy to monitor with a complete pivoting strategy since we are 
moving through the entire Gauss transform area at each stage anyway. For clarity, 
the pivoting algorithms and the Update algorithm are listed separately in this 
chapter. In real code (e.g., Appendix F), however, the pivot for stage (k +1) should 
be located during the update of G in stage k (to avoid unnecessary passes through 
the matrix). This would mean extra work in the partial pivoting algorithm. Since 
the primary reason for using partial pivoting is performance, it is counterproductive 
to monitor g(r) while using partial pivoting. A description of both pivoting policies, 


in algorithm form, follows. 


Algorithm 3.1 (Partial Column Pivoting for Size) Given the matriz of coef- 
ficients, AE R™*"; a permutation vector, q € R™; and an indez, k, indicating the 
pivot column, this algorithm performs partial pivoting. First, the pivot element is 
located at A(s,k) with s > k. Once the pivot has been located, rows s and k are 
swapped to install the new pivot. Additionally, elements in q, indered by s and k, 
are swapped to record the row interchanges. 


begin PP 
Se 
fori: =(k+1):m 
if (|A(z, k)| > {A(s, &)]) 
Se 
end if 


end for 
mes ~ hk) 


fonei— 1): 7 
eA ha )s 
A(k, 7) = A(s, 9); 
A(s,j) = 2; 


end for 
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Algorithm 3.2 (Complete Pivoting for Size) Given the matriz of coefficients, 
AE R™*" : nermutation vectors, p E R" and q € R™; and an indez, k, indicating the 
pivot row and column, this algorithm performs complete pivoting. First, the pivot 
element is located at A(s,t). Once the pivot has been located, rows s and k and 
columns t and k are swapped to install the new pivot. The permutation vectors are 
updated accordingly. 


begin PC 
—-# 
i= hs 
fori =k (locate the pivot) 
[ory =a 
if (|A(2, 9)| > |A(s, ¢)]) 
S = or 
7; 
end if 
end for 


end for 
af (Ss 3 45) (row interchanges) 
tone) = son 
t=A(k,j); A(k,j)=A(s,j); AS, 9) = 2; 
end for 
r=ag(k): glk) =a(s);  9(s) =3; 
end if 


if (t # k) (column interchanges) 
[One — ela 


r= Ai,k);  Ali,k) = ACi,0); (i,t) = 
end for 
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G. SEQUENTIAL ALGORITHMS 


The examples considered have described the Gauss process. We first considered 
elimination (GE) and then a factorization method (GF). Both methods require work 
of the same order, so the latter, yielding a factorization of A is much preferred. 
Algorithms for the GF process are described below. The arithmetic in the Gauss 
transform area, G, is performed the same (regardless of pivoting strategy) so a 
separate algorithm is given for updating G. The algorithms GFPP (pivoting, partial) 
and GFPC (pivoting, complete) are given following the updating algorithm. These 


algorithms are adapted from Gragg [Ref. 23]. 


Algorithm 3.3 (Update Gauss Transform Area) Given the matriz of coeffi- 
cients, AE R™*": and k, the pivot column, this algorithm performs the appropriate 
arithmetic throughout the pivot column and Gauss transform area, G, of A. 


begin Update 
TACK, Kk); (x is the pivot value) 


fora =(k+1):m (pivot column division) 


lee = Altea; 


end for 

forz=(k4+1):m (arithmetic in G) 
a Al, k): (now z is the multiplier) 
for ?=1:n 


A(t, j) = A(t, j) — 2 x A(k, 9); 


end for 
end for 


end Update 
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Algorithm 3.4 (Gauss Factorization with Partial Pivoting) Given the matriz 
of coefficients, A € R"*", this algorithm modifies (overwrites) A with a unit lower 

triangular matriz (with an implicit diagonal), L € R"*", and an upper (right) trian- 

gular matriz, R € R"*" having nonzero diagonal elements (the pivots). The process 

also forms the row permutation vector, q, and the corresponding permutation matriz, 

Q € R"*", that results from partial column pivoting for size. The algorithm gives 

the factorization: Q7A = LR. 


begin GFPP 
n = order(A) 


Q = zeros(n, 7) 


forj= ea 
q(j) = 9; (initialize q) 
end for 
for = (the Gauss process) 
PP(A,q.h) (pivoting) 


if (A(k,k) = 0) 
print “A is singular!” 
exit 


end if 
Update(A,k) (Update G) 


end for 


for 7 = 1 


end for 


end GFPP 
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Algorithm 3.5 (Gauss Factorization with Complete Pivoting) Given a ma- 
triz of coefficients, AE R™™*", the following algorithm modifies (overwrites) A with 
a unit lower trapezoidal matriz (with implicit diagonal), L € R™*", and an upper 
(right) trapezoidal matriz, RE R™*". The diagonal elements of R are nonzero (piv- 
ots). The process forms permutation matrices, P € R"*" and QE R™™™, to reflect 
the complete pivoting for size. These matrices are formed to satisfy the LR Theorem: 


QTAP = LR. 


begin GFPC 


m = rows(A); n = cols(A); (initialization) 
P =zeros(n,n); Q = zeros(m,m); 
for7 = 1:27 
PCI) = 3; 
end for 


ion? = 1:m 
q(t) = 2; 
end for 
for r= 1 isn (the Gauss process) 
PC(A,q, hk) (pivoting) 
if (A(k,k) = 0) 
print “A is singular!” 
exit 
end if 
Update(A,k) (Update G) 
end for 
for j=lJ:n 
P(p(3),3) = 1.05 (form P) 
end for 
for 7 = 13m 


Q(4(9),9) = 1.0; (form Q) 


end for 


end GFPC 
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H. CONJUGATE GRADIENTS 


Time permits only a brief synopsis of the method of conjugate gradients (CG). 
This method was described by Magnus R. Hestenes and Eduard Stiefel [Ref. 18]. 
CG possesses some very nice characteristics and it is quite different from the Gauss 


method. Once again, we begin with a system of linear equations 
A= > (3.56) 


The algorithm given by Hestenes and Stiefel is designed for A € R"*" symmetric 
and positive definite (Appendix A). Let s € #” be the vector that would solve (3.56) 
exactly, so that As = 6. Let u; € R” be the estimate of the solution, s, produced 
in the 2" iteration. The original estimate, uo, is merely a guess (it may be a good 
guess). For instance, in the absence of better information, we could choose ug to be 
the vector of all zeros or all ones. 

The CG process takes our initial guess and develops a (guaranteed) better 
estimate for the next stage. To measure the progress, we could use the residual 
vector 


r; = b— Au; (3.57) 


but Hestenes and Stiefel warn that its Euclidean norm, || 7; |/2, may actually increase 


in every step but the last! A more reliable measure, called the error vector 
€;=S—U; (3.58) 


has monotonically decreasing length. After n iterations of the CG process, we are 
guaranteed to have a very good estimate u, of s. In fact, if no rounding errors 
occur, we have u, = s. In practice, CG can find a very good estimate, um, of s 
in m iterations, with m < n. The process “terminates in at most n steps if no 


rounding—off errors are encountered.” [Ref. 18: p. 410] 
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The algorithm below is adopted from Hestenes and Stiefel [Ref. 18]. Before 
considering the algorithm, however, we should define the key term, conjugate. For 
A symmetric, two vectors r € KR” and y € FR" are said to be A-orthogonal (or 
conjugate) if the relation z7 Ay = (Ar)’y = 0 holds (Ref. 18: p. 410]. This is 
an extension of vector orthogonality, z7y = 0. The algorithm given below is very 
simple. The iteration blindly proceeds from 2 = 0 toz = n. A more sophisticated 
(finite precision) scheme would set a tolerance (notion of “good enough”) and stop 


(exit the loop) when this criterion was satisfied. 


Algorithm 3.6 (The Method of Conjugate Gradients) Given the symmetric, 
positive definite matriz of coefficients, A € R"*"; and an initial guess, uo; for the 
solution, s; of the system Au = 6b, thts algorithm (in the absence of rounding-off 
errors) finds u; = s tn? iterations (1 <n). The algorithm keeps track of a residual 
vector, r;, and direction vectors, p;. The residuals, r;, are mutually orthogonal and 
the direction vectors, p; are mutually conjugate (A-orthogonal). 


begin CG 
Uo =zeros(7) (arbitrary initial guess) 
Po = To = b— Aug 


ferz=O0:n 


2 caette 7a) (denominator used below) 
i (pn, \/0 (scalar multiplier used below) 
cy = th, 4 Gp, (estimate of solution) 
Tig, = 7; — a; Ap; (residual vector) 
B; = (ri,7i)/6 
Piti = Tiara + Bip; (direction vector) 
end for 
end CG 
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I. SUMMARY 


This chapter develops the Gaussian elimination process, the Gauss factoriza- 
tion process, pivoting strategies, and (briefly) the method of conjugate gradients. 
Each of the corresponding algorithms possesses potential for parallel solution. A 
parallel implementation of GF appears in the following chapter. Both partial and 
complete pivoting are pursued, with further discussion on their implications in a 


parallel environment. 
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IV. PARALLEL DESIGN 


Nature is pleased with simplicity, and affects not the pomp of superfluous 
causes. 


— SIR ISAAC NEWTON (1642-1727) 


Sequential algorithms for Gauss factorization (GF) and the method of conjugate 
gradients (CG) are established in Chapter III. The goal of this chapter is to show 
parallel algorithms for Gauss factorization. The C programs that implement these 
algorithms are discussed in Chapter V and listed in Appendix F. 

Parallel algorithm design is a process that includes many considerations. The 
question of how to achieve parallelism is largely an art and is not discussed here. 
The method used in this research is often called a workfarm approach because the 
algorithm farms out work to processors. Equivalently, it may be called a manager- 
worker model. When we distribute the problem across many processors in a workfarm 
stvle, there are quite a number of issues that warrant careful consideration. The 
concerns associated with programming a parallel machine—even with a relatively 
simple model such as this—could occupy volumes. 

Communications, load balancing, granularity, and other considerations abound. 
Metrics like speedup and efficiency should be used to lend credibility to the parallel 
nature of the algorithm. Additionally, we should consider the usual issues of main- 
tainability, readability, portability, and other traits commonly associated with good 
(sequential) programming practice. Parallel] codes must be clear combinations of 
sequential codes that are joined together in a logical manner. Simplicity should hold 
a place of great esteem in a parallel algorithm. The rest of this chapter introduces 


the issues of parallel design, particularly as they pertain to Gauss factorization. 


a 


A. INTERPROCESSOR COMMUNICATIONS 


Interprocessor communication is one of the most fundamental issues in parallel 
processing and, quite possibly, the most involved. Without a means of communicat- 
ing (in a message-passing environment), the multiprocessor system is meaningless. 
The implications of any communications scheme are many and the interactions can 
be quite complex. Exhaustive coverage of this issue is out of the question, so we will 


consider a few of the most essential ideas. 
1. The Network 


A network is the part of a multiprocessor system’s hardware that bears 
the interprocessor communications burden. It is a combination of nodes and links 
that connect those nodes, and it is the foundation upon which all communications 
must build. We will also refer to the nodes of a multiprocessor—using somewhat 
loose terminology—as processors. The term node is a more general term. Nodes 
are typically more sophisticated than a simple central processing unit (CPU) or, for 
that matter, any other sort of processor. The link is a wire that connects two nodes. 
An interconnection topology describes the pattern of links used to connect the nodes 
of a network. The network can be drawn or illustrated so that we can see how its 
nodes are connected. Appendix C discusses interconnection topologies and it gives 
a description (and illustrations) of the particular scheme used in this research: the 
hypercube. 

Intel combines an 80386 CPU with an 80387 math coprocessor and commu- 
nications facilities to form a “CX” node for the iPSC/2 that was used in this research. 
INMOS provides the same general capabilities but packages it all on a (very sophis- 
ticated) single chip, called a transputer. Figure 4.1, from INMOS’ T9000 Transputer 
Products Overview Manual {Ref. 25: p. 31], shows a high-level block diagram of the 


—~] 
to 


components of a T9000 transputer. Thus, any node of a message-passing multipro- 
cessor system can be thought of as a combination of computing and communications 


facilities. It may possess other capabilities as well. 
2. Message Routing 


The machines used in this research exhibit different message transmission 
schemes. The transputer system employs high-speed (20 megabits per second) point- 
to-point serial communications and store-and-forward message passing. That is, for 
multi-hop communications, each node along the way must receive the message, store 
it in local memory temporarily, and then pass it to the next node in the route. 

The Intel iPSC/2 uses another technique, called circuit switching or direct- 
connect communications. This approach is much like our telephone system. First, 
the originator of the message sends a small message containing information about 
the message (e.g., destination node number, length of message) to the destination 
via the nodes in-between. As this small header packet makes its way to the destina- 
tion the nodes along the way flip switches, closing a circuit from the sender to the 
receiver. Once this circuit is established, the message proceeds from the sender to 
the destination without interruption. 

Each method has its advantages and disadvantages. The circuit switching 
approach allows for fewer interruptions along the way, but it ties up the entire path 
for the duration of the communication. The store-and-forward method imposes 
delays for storing the message into, and then retrieving it from, the memory of every 
node along the way. (A more complete description of these two techniques, together 
with experimental results, is given in Appendix B). For the algorithms employed in 
this research, almost all communications were “nearest neighbor” in the hypercube. 
In this case, the two approaches to message routing are insignificant and the nearest 


neighbor performance becomes more important. 
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3. Concurrent Computing and Communicating 


The nodes of a multiprocessor machine should be able to both compute 
and communicate effictently and concurrently. This is no small undertaking. The 
computing side must access memory to accomplish its mission, but the message- 
passing begins by drawing data out of memory and ends by storing data into mem- 
ory. Therefore, at a minimum, we have competition related to memory accesses. 
Furthermore, the computing and communication must be synchronized to some ex- 
tent. The algorithms used in this research used blocking communications—described 
in Appendix E—which enforces synchronization. 

There are overheads associated with communications and this synchroniza- 
tion problem. Bryant showed how transputers perform under various communica- 
tion loads [Ref. 26] and this is mentioned in Appendix E. The issue of overheads 
is one that Charles Seitz considered for the “Cosmic Cube.” Much, but not all, of 
the overhead is communication-related. Seitz listed three of the major problems 
eta 2(: p. 28): 


(1) the idle time that results from imperfect load balancing, (2) the wait- 
ing time caused by communications latencies in the channels and in the message 
forwarding, and (3) the processor time dedicated to processing and forwarding mes- 
sages, a consideration that can be effectively eliminated by architectural tmprove- 
ments in the nodes. 


Included in these costs, we should also recognize that some amount of time is required 
for the processor to perform “context switching” (changing jobs) and/or coordination 
with a special-purpose processor that we might call the communications manager. 
Although the issue of concurrent communication and computing is a very 
complex one, we may consider significant issues that are related to the efficiency of 
communications and the effect upon the processor. Geoffrey Fox presents the notion 
of comparing communications ability to processing ability [Ref. 28: pp. 50-51]. Let 


teaic be “the typical time required to perform a generic calculation. For scientific 


9 


problems, this can be taken as a floating-point calculation a= bxcora=6b4c.” 
Furthermore, let teomm be “the typical time taken to communicate a single word 


between two nodes connected in the hardware topology.” Then the ratio 


UeorGn 


Vente 


isa general characteristic of a particular system that can be quite useful in comparing 
machines. Fox uses this ratio in much of the rest of his work. 

A parallel machine must necessarily possess a capable communications sub- 
system, but this is not enough. The program should also make prudent use of the 
communications facilities. This means that the programmer and/or compiler must 
exhibit a good understanding the machine’s communications abilities and weak- 
nesses. Some characteristics are nearly universal. Most machines, for instance, 
reward the use of long messages because there is an overhead—nearly independent 
of message length in many cases—to sending any message. Other characteristics are 
very much machine-dependent. This means that the programmer should be rela- 
tively familiar with the communications abilities and characteristics of the target 


machine. 
4. Accessing the Clock 


The ability to accurately measure the time required by communications 
and computations, preferably at the host and every node in the system, is absolutely 
essential in a multiprocessor environment. Profiling, in a sequential program, allows 
us to compare the time required by various parts of a program. Timing in a parallel 
environment allows us profile the code. Thus we can determine the time required for 
instructions, loops, functions, or communications. 

Profiling is an even more important practice for parallel coding than it is in 


the sequential case. The only way for a parallel program to be useful is if it can be 
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can be implemented efficiently upon an acceptable number of processors. That 1s, 
in general, the only object in choosing a multiprocessor system over a sequential 
machine is the speed with which computation can be performed. One of the best 
tools available to the parallel programmer is the ability to see where and how much 
time is being spent. 

At a minimum, we need the ability to sample a clock with reasonable preci- 
sion. Both machines and compilers used in this research provide this capability (see 
timing.h in Appendix F for details). The transputers offer a choice of frequencies: 
the clock associated with low priority processes has a period of 64 microseconds and 
the high priority clock offers one microsecond ticks. The iPSC/2 mclock() function 


gives time in milliseconds. 


B. METRICS FOR PARALLEL COMPUTING 


1. Complexity 


Perhaps the most obvious measures for a parallel algorithm are simply 
those that we use for sequential algorithms. We want to keep time and storage 
requirements to a minimum. Perhaps the major difference in complexity analysis 
for a parallel algorithm is that we are primarily interested in a per—processor notion 
of complexity. If the problem has been farmed out in a fair manner, complexity 
analysis for the parallel case is merely an extension of the sequential case. 

Consider the matrix A € #"*%". Suppose that its elements are 8-byte, 
double-precision, floating-point values (type double in C). Let M/, denote the total 
memory (in bytes) required to store A on p processors and let JT, denote the time 
required for p processors to solve the system characterized by A. Then M1, = 8n? 
bytes of storage, but (ideally) Afg = n*. When the problem is distributed across p 


processors simultaneously, the processors can share the storage burden. 


~} 
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Exceptions abound. For certain problems, it may actually be convenient 
(faster or more reliable) to store the entire matrix at each processor. Nevertheless, 
in most cases we would like to minimize local memory requirements. The Gauss 
factorization algorithm considered near the end of this chapter is no exception. In- 
deed, the transputers used in this work had only 32 kilobytes of storage each and 
the results of Chapter VI for transputers show how this can dictate the size of the 
problem that can be executed. The concepts of time and storage complexity have 
been developed in detail for sequential algorithms and they seem to hold a place in 
parallel algorithm assessment as well. We consider other measures that have been 


developed for parallel computing in the following section. 
2. Contemporary Measures 


The concepts of speedup and efficiency (Appendix A) are two of the most 
common performance measures currently associated with parallel computing, with 
the ideal case (100% efficiency) yielding tp = t;/P on a P-processor system. Selim 


Akl proposes the following criteria for analyzing algorithms [Ref. 29: pp. 21-28]: 


e Running Time: Running time t(n) is the time required to execute an al- 
gorithm for a problem of input size n. Akl lists three ways to express this 
notion. First, we may count the steps in an algorithm. Ak] distinguishes be- 
tween computational steps (i.e., something like flops) and routing steps that 
are associated with interprocessor communication. Second, we have lower and 
upper bounds (e.g., the complexity notation presented in Appendix A). Fi- 
nally, we have speedup. Akl gives the usual definition of speedup but clarifies 


it somewhat (details below). 


e Number of Processors: Second in importance, Akl considers the number of 
processors required by an algorithm. He uses p(n) to denote the number of 


processors required for a problem of size n. 


e Cost: Akl defines the cost, c(n) for a parallel algorithm as the product of the 


first two factors. That is, c(n) = t(n) x p(n). 


e Other Measures: In this category, we have no less than three other qualities 
of a parallel system that deserve consideration. The area (i.e., chip real estate) 
required by the processors is significant. The length of the links, as well as 
any patterns figures in (regularity and modularity). And finally, the period 


between processing different elements of an input 1s important. 


Apparently metrics for parallel computing are still developing. There are several 
very useful concepts such as speedup and efficiency. The definition of speedup, at a 
first glance, is rather standard. It doesn’t take much probing, however, to find that 
different authors make different assumptions. Akl defines speedup S in the usual 


manner. 


ca (4.1) 
ip 


except that he is somewhat more specific about the times. He defines t, as the 
“worst-case running time of fastest known sequential algorithm for problem” and tp 
as “worst-case running time of parallel algorithm.” [Ref. 29: p. 24] He has been 
more specific than most authors, but it seems likely that the algorithms, method of 
obtaining times t, and tp, and systems should also be specified. Speedup is defined 
loosely in most cases. A parameterization to accompany speedup would be tedious, 
but useful. Until speedup becomes a standard term with accepted meaning, we shall 


have to specify exactly what it means. We should be more careful with this term. 
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3. Other Ideas 


Akl has appropriately distinguished between computational steps and rout- 
ing steps. The term floating-point operations (flops) has become quite popular (along 
with benchmarks) and this is a useful means of expressing the computational ability 
of a machine (for floating-point applications). The notion of routing, however, is 
somewhat vague. Nevertheless, this idea must be addressed. It should probably 
become more specific as we talk about similar machines. 

The machines used for this research were MIMD message-passing systems. 
We can get much more specific about “routing steps” for such a machine. First, using 
the clock as a stopwatch, we can profile any segment of code (including calculations 
and/or communications). An implementation specific version of Fox’s teomm/teale 
ratio can be instructive. It is important to apply this ratio to the hardware as Fox 
defines it, but it is equally important to recognize the role of the software (algorithm). 
That is, for some specific implementation, we should be interested in finding some 
measure of how much time is spent communicating and how much time is spent 
computing. More specifically, a careful profile could be made of a program in the 
following manner. 

The ratio of cumulative (i.e., over the execution of the entire program) time 
spent communicating to time spent computing should be considered as a first cut, 
especially if performance (efficiency) is weak. Algorithms such as Gauss factorization 
are executed in stages, within a loop of some sort. In this case, the teomm/teale 
ratio per iteration is an interesting figure (and—if the loop represents most of the 
program’s execution time—this should be approximately equal to the cumulative 
figure). 

When possible, the analysis of communications complexities should be an- 


alyzed carefully. For instance, in the Gauss factorization code that is presented in 
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Appendix F, a C structure is used to relay the owner (node id) of a pivot and the 
pivot’s row, column, and value. This structure is 20-bytes of data and we know 
the pattern with which these structures are moved about during the course of the 
program. It is important to quantify communication like this when possible. The 
vague notation should lose significance in the presence of such concrete information. 

There are other important and related ideas. The frequency and volume 
of communications traffic is easy to determine with a high degree of accuracy for 
algorithms such as Gauss factorization. Once again, in the presence of this kind 
of information, we should dispense with vague concepts. It is useful to consider 
something like a pie chart showing the various amounts of time spent on each portion 
of the major loop in a program. Indeed, this was a part of the development of the 
Gauss code given in this thesis. Tools such as these are important in refining parallel 
algorithms and streamlining code. 

The parallel program designer must consider many other issues regarding 
communications. Graph theory notation is a natural tool. A link-by-link analysis 
of the communications over the course of a program is not out of the question (espe- 
cially if the communication is merely a repetition of very simple messages). Efficient 
use of the topology is important. We should consider the percentage of links used, 
balancing of the communications load, frequency of traffic for each link (often the 
communication comes in bursts and often between iterations of the basic algorithm), 
flow rate (in bytes per second) for each link during the bursts or over longer periods 
of time, timelines showing dependencies, and other specific characteristics of commu- 
nications. Analysis should be done on a per-stage basis for algorithms that exhibit 
iteration (loops). 

Perhaps most importantly, a plan for interprocessor communication should 
begin well in advance, before the code is ever written. A reactive approach is neces- 


sary, like debugging code. But a proactive, strong design effort can simplify matters. 
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The notion of communicating sequential processes (CSP) deserves attention. This 
model is due to C. A. R. Hoare [Ref. 30], and it is never far away in the world of trans- 
puters. There is a very close relationship between transputers, occam (their native 
language), and CSP. CSP is a useful paradigm for this sort of (message-passing) 
machine. When possible, a problem should be logically separated into processes. 
The division of the problem should be natural, so that every process represents a 
logical group of tasks. The processes are allowed channels to communicate, and these 
channels are implemented as either links in hardware or buffers in memory if, for 
instance, two processes on the same processor wanted to communicate. 

If a problem is designed correctly, we should have substantial amounts of 
work within a process and minimal interprocess communication. If the processes and 
channels are represented as the nodes and edges of a directed graph, we can make 
use of some nice tools and theorems from graph theory. For instance, we should like 
to maximize computation and minimize communications. One natural method is to 
begin with atomic processes and start to build. 

Suppose that we have many such processes (at least as many as processors) 
and we represent them as the nodes of a directed graph. We can assign the processes 
(nodes) a weight that reflects some form of computational difficulty. This should be 
a fairly concrete number, assuming that the task (process) is well-defined. It might 
be the number of flops per iteration, for example. Next, the channels should be 
clearly indicated as weighted, directed edges. The weight should usually be a very 
concrete number as well, like the number of bytes that passes along that channel 
between each stage of a computation. 

This model gives the problem the sort of order that is necessary to keep 
the parallel design simple, logical, and formal (i.e., friendly for proof of program 
correctness). Once the problem has been expressed in such a manner, there are 


many options. For example, we could consider minimum cuts of the flow rates to 
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decide how to efficiently apportion processes to processors. This mapping alone could 
greatly enhance the performance of code. 

It seems that much of the work in this area is rather imprecise and generally 
unacceptable. Granted, parallel design methodology is a relatively recent problem 
but it can be improved substantially. Good parallel designs that consider these kinds 
of issues and express them clearly will likely be in high demand as parallel computing 


machinery develops. 
C. PARALLEL METHODS 


The wide-ranging capabilities of contemporary computing machinery are evi- 
dent. An exhaustive list would demand pages, but most readers could readily name 
several applications that bear little resemblance to each other. For a single, very spe- 
cific machine there is almost no limit to the combinations of sequential instructions 
that it may carry out. Put another way, a particular machine can be designed and 
built in a few months or years depending upon the level of sophistication involved. 
But the different types and purposes of software that may be created to run on that 
single machine are nearly limitless. Consider Householder’s comments on the art of 


computation [Ref. 17: p. 1]: 


If a computation requires more than a very few operations, there are usually 
many different possible routines for achteving the same end result. Even so simple 
a computation as ab/c can be done (ab)/c, (a/c)b, or a(b/c), not to mention the 
possibility of reversing the order of the factors tn the multiplication. Mathemat- 
tcally these are all equivalent; computationally they are not (cf. §1.2 and §1.4). 
Various, and sometimes conflicting, criteria must be applied in the final selection 
of a particular routine. If the routine must be given to someone else, or to a com- 
puting machine, tt 1s desirable to have a routine tn which the steps are easily laid 
out, and this ts a serious and important consideration in the use of sequenced com- 
puting machines. Naturally one would like the routine to be as short as possible, 
to be self-checking as far as possible, to give results that are at least as accurate as 
may be required. And with reference to the last point, one would like the routine to 
be such that it ts possible to assert with confidence (better yet, with certainty) and 
in advance that the results will be as accurate as may be desired, or if an advance 
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assessment is out of the question, as tt often is, one would hope that it can be made 
at least upon completion of the computation. 


— ALSTON S. HOUSEHCID Ea 


Parallel algorithms are combinations of sequential ones, so their complexity 
can grow quickly. In general, the hardware issues surrounding parallel problems 
are mature and straightforward. Software, on the other hand, is developing and 
generally difficult to use. 

In addition to the familiar design considerations for a straightforward sequential 


algorithm, the design of a parallel solution must specify: 


e An awareness of the interaction between processing and communication. Fre- 
quency and duration (message length) of communications should be known, if 
possible. Additionally, we should know how this compares to the frequency 


and duration (flops) of computing work. 
e A plan for interprocessor communication; including hardware and software. 
e A scheme for memory usage. 


e The granularity of the problem (i.e., should the processors be given larger or 


smaller “chunks” of work at a time). 
e Load balancing among several processors. 
e A method for accessing input/output resources. 


This is a very high level look at the problem. The issue of communications alone, 
can be more than half of the problem. The simplicity of this short list does not do 
the problem justice. Correct execution, as in the sequential case, is very important. 
But parallel algorithms are subject to the added scrutiny of performance data (e.g., 


efficiency). 
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The methodology for constructing parallel algorithms is a very creative process, 
and there are many questions that can be asked. Is a highly efficient parallel solution 
possible, or is the problem bound by dependencies and sequential work? What is 
the ratio of time spent communicating to time spent computing? How nearly does 
a given algorithm approach the optimal solution? What would happen on some 
other number of processors? Are there any bottlenecks that can be eliminated? 
Nevertheless, the current performance of parallel machines and the promise of fu- 
ture architectures is more than adequate motivation to continue developing these 


products. 
D. ALGORITHMS 


With the preceding concerns in mind, let us consider the algorithm for Gauss 
factorization that was used in this work. The algorithm is given at a very high 
level because detail can be gleaned from Chapter V and from the actual code in Ap- 
pendix F. The first consideration for GF was “How should the work be distributed?” 
There are many options. The matrix could be distributed by rows, or columns, or 
blocks. The method chosen in this case was a distribution of the columns of A across 
the nodes of the machine. The columns were distributed so that column 7 went to 
processor number 7 (mod P) in a P-processor network. 

Such a distribution scheme seems natural for several reasons. First, the work 
associated with the Gauss process moves toward the lower right-hand corner of the 
matrix A € R"*". By using a modulus assignment, and assuming that n > P, we 
have a situation where the load on the processors is nearly balanced for most of the 
process. Second, a column-oriented assignment places the pivot column on a single 
node at each stage. This makes division by the pivot value a simple task. It is 


interesting to note that a similar distribution of A by rows would have merit as well. 
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Once the matrix has been distributed, the code simply moves, in a synchronized 
fashion, from stage to stage of Gauss. At each stage, we must pivot according to 
some strategy. The complete pivoting showed especially poor performance since it 
involved a great deal of communication and synchronization between stages. The 
partial pivoting method allows us to determine which node will have the pivot and 
much less communication is required when this node simply broadcasts the pivot and 
pivot column. After the pivot node divides every element under the pivot by the 
pivot value, it broadcasts the entire pivot column to every other processor. When the 
processors obtain the pivot column, they use the multipliers to perform arithmetic 
in the Gauss transform area, and then proceed to the next stage. 

The following algorithms give an overview of the programs that appear in Ap- 


pendix F. 


Algorithm 4.1 (Parallel GF: Host) At this level, the host code is essentially the 
same for both partial pivoting and complete pivoting. The program is very simple: 
distribute the columns, and then accept them back one-by-one. Let A € R™*” be 
the matriz of coefficients, and let P be the number of processors. This algorithm 
forms the modified copy of A by overwriting the original copy. After the n** column 
is returned from the nodes, we have the factored version of A that can be separated 
into L and Ff in the usual manner. 


begin GF (Host) 
for 7 =0:(n-1) 
send A(:,7) to node (j mod P) 
end for 
for r=0:(n-—1) 
receive A(:,r) from node (r mod P) 


end for 


end GF (Host) 
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Algorithm 4.2 (Parallel GFPP: Nodes) Let A € #™*" be the entire matrirz 
(held at the host). This algorithm is executed on each node in a P-processor network. 
Let the node number be N and let An € R™*%*" be the local copy of select columns 
of the matriz A (where my % m/P 1s the number of columns held locally). Let Gn 
be that part of the Gauss transform area, G, that ts held locally. This node receives 
every column, j, of A where (7 mod P)=N. 


begin GFPP (Nodes) 
for 7 =0:(myn —1) 
receive column and place in Aj(:,7) 
end for 
forr =0:(n-—1) 
Hear modal) NV (pivot is held locally) 
perform partial pivoting 
broadcast pivot row index, s, to all nodes 
perform pivot column arithmetic 
broadcast pivot column to all nodes 
else 
receive pivot row index, s, and perform row interchanges 


receive broadcast of pivot column 


end if 


ne = 0 
send pivot column to host 


end if 


perform arithmetic in Gy 
end for 


end GFPP (Nodes) 
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Algorithm 4.3 (Parallel GFPC: Nodes) Let A € R™*" be the entire matriz 
(held at the host). This algorithm is executed on each node in a P-processor network. 
Let the node number be N and let An € R™**" be the local copy of select columns 
of the matriz A (where myn % m/P is the number of columns held locally). Let Gn 
be that part of the Gauss transform area, G, that ts held locally. This node receives 
every column, j, of A where (7 mod P)=N. 


begin GFPC (Nodes) 


for 7 =0: (myn -1) 
receive column and place in Aj(:,7) 


end for 
forr =0:(n—-1) 


locate best (local) pivot candidate 


elect pivot (let node Np hold the winner of the pivot election) 
MN > ae 

broadcast pivot indexes, (s,t), to all nodes 

perform pivot column arithmetic 

broadcast pivot column to all nodes 
else 

receive pivot indexes, (s, t) 

perform permutations 


receive broadcast of pivot column 


end if 


Mve= 0 
send pivot column to host 


end if 


perform arithmetic in Gn 
end for 


end GFPC (Nodes) 
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Ve. IMPLEMENTATION 


A. ENVIRONMENT 


Chapter IV introduces parallel algorithms for Gauss factorization (GF). The 
GF algorithms are produced for partial and complete pivoting strategies. All of 
the programs associated with this research are written in parallel versions of the C 
language and executed on two types of machines at the U. S$. Naval Postgraduate 
School. The Math Department’s iPSC/2 afforded eight of Intel's CX type processors 
arranged in a hypercube topology. The Parallel Command and Decision Systems 
(PARCDS) Laboratory in the Computer Science Department has more than seventy 
transputers available for the experiments. The discussion below gives a more exact 


description of the material and equipment used in the work. 
1. Hardware 


This section describes the machines upon which the work was carried out. 
A general knowledge is assumed, including familiarity with the Intel 80386 micropro- 
cessor, 80387 math coprocessor, and INMOS transputers. Some of this information 
is provided in Appendix B. 

The hardware used in this research represents the state-of-the-art for the 
mid-to-late 1980s. These machines are quickly becoming outdated—fitting the his- 
tory of computing—but both INMOS and Intel have more recent, competitive prod- 
ucts in today’s market and fine prospects for future machines. So, while they are 
a bit dated, the products used in this research represent important contemporary 


parallel architectures. 
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Figure 5.1: Hypercube Interconnection Topology: Order n < 3 


a. Networks of Transputers 


The majority of the research was performed upon hypercubes of order 
n € {0,1,2,3}. These are the usual hypercubes (see Appendix C) and each is 
imbedded in the 3-cube. Figure 5.1 shows this topology. Some of the transputer 
work for this thesis was performed by a network of sixteen IMS T800-20 transputers 
connected in nearly hypercube fashion (Figure 5.2). This is not identical to the 4— 
cube, so it will be called the hybrid cube (it is used as a root with two subtrees that 
happen to be 3-cubes). The subtrees of the hybrid cube can be distinguished by the 


first bit. One of the 3-cubes has labels like Orrz; the other is labeled lzrz. 
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Figure 5.2: Hybrid Hypercube Interconnection Topology 


9] 


The rationale behind building the hybrid cube is purely practical. The 
transputers have only four links. Assuming that we define nodes of the hypercube to 
be a single transputer, a pure hypercube of order four would be a closed interconnec- 
tion scheme with no opportunity for input or output to or from the system. Here, 
the root node has been inserted between nodes zero (0000) and eight (1000). While 
this deals a horrible blow to the elegance of hypercube algorithms—particularly 
communications—it can be used effectively. 

The hardware for the hybrid hypercube its configured with code by Mike 
Esposito [Ref. 31]. This gives us sort of an unlabeled version of the structure that 
appears in Figure 5.2. To make use of this configuration, the nodes must be labeled 
in a logical fashion. The Gray code (Appendix C) is a reasonable choice for labeling 
the nodes. The actual labeling 1s accomplished by a Network Information File (NIF) 
When the transputers are loaded by the Logical Systems C Network Loader, LD- 
NET. A more detailed description of this process is contained in the file named 
hypreube.nif in Appendix F. 

Networks of transputers use point-to-point communications across bidi- 
rectional inks. The links for this work operate at 20 megabits per second (bidirec- 
tionally). That is, ten megabits per second is a peak unidirectional transmission 
rate. Current transputer implementations employ a store~and-forward approach to 


message passing (see Appendix B) for multi-hop transmissions. 


b. Intel iPSC/2 


The iPSC/2 used for this research contained eight processors of the 
“CX” type (80386/80387 combination). The host is an 80386-based IBM-compatible 
personal computer running AT&T UNIX System V (version 3.2). The nodes run a 
local subset of UNIX called NX. The host is capable of supporting many users at 


once, but each node only supports a single-user. 
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Users can request p nodes, where p = 2” for n € {0,1,2,3}. If another 
user does not already have the requested portion of the cube, the request is granted. 
As long as nodes remain, another user can access them. For instance, one user could 
be working on two nodes and—at the same time—another user could access up to 
four others. While the first two users still possessed these six nodes, a third user 
could get one or both of the remaining two nodes. 

Unlike the transputers, Intel uses a direct-connect circuit switching (see 
Appendix B) approach to multi-hop communications. There is an overhead associ- 
ated with setting up the path for communication, but this cost is nearly the same 
regardless of how many hops the message cross. Once the circuit is established, 
the message can proceed directly from the origin to the destination with negligible 


interference from intermediate nodes. 


c. Host and Root 


The notion of host is similar on both machines, but there is a slight 
difference. The Intel hypercube is directly connected to the host. The transputer 
network, however, uses a substantially different protocol than the typical personal 
computer. Transputers employ point-to-point serial communications, using an 11- 
bit link protocol with byte-by-—byte acknowledgment. The acknowledge is a two-bit 
packet with dual meaning. The receiving transputer has begun to receive the byte 
and it has storage space for another. 

In the transputer case, host means the PC. We use the term root trans- 
puter to identify the transputer within the host PC that acts something like a host 
to the attached network of transputers. Figure 5.1 illustrates this configuration. An 
IMS B004 extension board in the host PC holds a T1414 root transputer. The B004 
is plugged into the PC’s bus and a parallel-serial converter lies between the PC and 


the 1414. In Figure 5.1 the “host” is a PC and the “root” transputer is the T414. 
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The iPSC/2 host is simplified, and could almost be thought of as a combination of 
the host and root for the transputer case. Since the entire thesis uses the same pro- 
grams for both machines, the root and host terminology can become confusing. As it 
is not always convenient to express this difference in painstaking detail, I will use the 
terms somewhat loosely. An understanding of the differences between the machines 
should serve to eliminate confusion in every case. When only one of the terms (host 
or root) is needed, I have used the correct term. When both of the terms apply, | 
have used them almost interchangeably and they should be interpreted according to 


the machine under consideration. 
2. Software 


The software for this research was written in the C language. The Logical 
Systems C product (version 89.1] of 15 January 1990) was used for the transputer 


implementation. For the iPSC/2 work, the C compiler supplied by Intel was used. 
B. COMMUNICATIONS FUNCTIONS 


Prior to implementing the Gauss algorithms, a substantial communications 
package was constructed. Most of the code for communications appears in the files 
comm.h and comm.c (see Appendix F). As expected, the header file provides 
definitions for manifest constants and specifications (declarations) for the functions. 
An overview of the functions provided in this file is is useful before we discuss the 
Gauss code that called these functions. 

The cubecast() function supports broadcasts from the host to all the nodes 
of a hypercube. Given a hypercube of order n € {0,1,2,3} with p = 2” processors, 
this communication is completed in n, or log,(p), stages. This has some utility 
in a 3-cube, but imagine the impact in a 10-cube. All 1,024 processors in the 


hypercube would have the message after 10 stages of communication. This function 
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is especially useful at the beginning of a problem, when data must be shipped to 
each of the workers in the network. 

Often we need to gather information in the reverse direction, from the workers 
back to the root. The coalesce() function is one way to accomplish this task. If no 
modification was necessary at intermediate nodes, this operation could be completed 
without interference. In the algorithms that I used, however, there was occasion to 
modify the information along the way back to the root. For this reason, the gathering 
is accomplished using two function calls. First, information is coalesced to a given 
node. Upon return from coalesce(), the data exists locally and may be operated 
upon. When the data is ready for submission, the submit() function is used to pass 
it one step closer to the root. 

A modification of the cubecast() function that was useful for the Gauss prob- 
lem was cubecast_from(). This function does not assume that the host is the 
originator of the broadcast. Instead, the source is specified as the first argument to 
this function. The function still performs the broadcast in log,(p) stages, but it uses 
the concept of a direction to accomplish this. 

The concept of directions in the hypercube turns out to be a fairly useful 
one. For concreteness, consider the 3-cube shown in Figure C.2. Starting at 
any given node, we can specify a direction using one of the three combinations 
d € {001,010,100}. Suppose that the node’s label is @ and let G denote the exclu- 
sive OR operation. Then for some direction, d, the number (£4 d) is the label of the 
node in the direction d from the node &. 

This concept can be applied in general in a hypercube of order n using n-bit 
labels for the nodes and some direction d. The possible directions are all the n 
combinations of (n — 1) zeros and a single one in an n-bit number. Accordingly, 
the code uses directions d € {1,2,4,...2"~'}. In most cases, when a direction—by- 


direction approach is desired for all possible directions, we start with one and use 
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the C left shift operator (<<) to produce the other directions incrementally. 
These functions and several others are described in detail in the code of Ap- 
pendix F, but these basic ideas give us a reasonably good introduction at a level 


that is adequate for understanding the algorithms. 
C. CODE DESCRIPTIONS 


A detailed description of the source code used to implement the algorithms of 
Chapter IV is given in the header file gf.h. This header file, located in Appendix F, is 
used by both the partial pivoting and complete pivoting codes. The code for GF with 
partial pivoting can be found in gfpphost.c, the host program, and gfppnode.c, 
the node program. The code for the complete pivoting algorithm is similar except 
for the election of pivots, so most of it has been omitted in the interest of saving 
space. Only the elect_next_pivot() function remains because it is the significant 
difference between the partial and complete pivoting codes. This function appears 


in gfpcnode.c. 
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VI. RESULTS 


A. GAUSS WITH COMPLETE PIVOTING 


The host code, gfpchost.c, and the node program, gfpcnode.c, are written 
to provide a parallel implementation of Gauss Factorization with complete pivoting. 
Since the columns of A are distributed among the nodes of the multiprocessor system, 
the selection of each pivot requires communication. The selection process, in this 
case, begins with each node selecting its own best candidate for pivot. Once each 
of the nodes has made this choice, an election is held to select the best candidate 
among al] of the nodes. 

Implementation details for the election process are described in the source code, 
so a detailed description is not given here. Nevertheless, these results show how 
communication—like the election process—can withstand efficient parallel program- 
ming. This program shows how parallel performance can suffer from the effects of 
communications. (Recall Fox’s teomm/teate and Seitz’s three components of overhead 
from Chapter IV). 

The complete pivoting strategy inserts inefficient communications between each 
stage of the process. The communications themselves are bound to be inefficient since 
the election process finds all nodes of an n-cube participating in an n-stage exchange 
of a 20-byte structure (pivot candidates). In addition to the use of small messages, 
the election imposes an added measure of synchronization upon the problem. This 
allows the processors less independence and forces them to transition between “use- 
ful” program execution and communication more frequently. This transition can 
become burdensome and the processor can eventually find little time to perform 


calculations. 
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In addition to the election process, there is a one-to-all broadcast from the 
node holding the pivot to inform the others of the pivot column values. With an 
mxm matrix A, this message is essentially a column of m double precision floating- 
point values. Doubles for this implementation were eight bytes each, so this is a 
unidirectional broadcast of 8m bytes with exponential fanout. 

The election process—as simple as it appears—will prove to be an obstacle 
that opposes efficiency. Both the iPSC/2 and transputer systems reward, in terms 
of transmission rates, the sender of long messages. Short messages are essentially 
penalized by the overhead involved in setting up the transmission line and manager. 
Let us consider the results of this complete pivoting strategy. The results from the 
iPSC/2 appear first followed by the transputer results. The largest dimension, n, 
that is recorded is n = 176. The iPSC/2 machine would handle larger problems, but 
this seemed pointless since the performance appears to approach maximum efficiency 


early. 
1. Data for the iPSC/2 System 


Table 6.1 shows the timing data for execution of Gauss Factorization with 


complete pivoting on the Intel iPSC/2 system. 
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ie Bib Ole Ne CULONINIES FORTGr(PC) ON THE 1PSC/2 


Dimension Time (seconds) on a Hypercube of Order 
n ee 
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36.123 
49.227 
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The speedup data that is shown in Table 6.2 is derived from these execution times. 


Speedup was calculated using the usual formula (see Appendix A for details) 


eget 


Sp 7 
P 


for speedup on p processors. 
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TABLE 6.3: EFFICIENCIES FOR GF(PC) ON THE iPSC/2 


a — Sent ieee on a Hypercube of — 





Given the execution times and speedups presented in Tables 6.1 and 6.2, and using 


the formula 


(as defined in Appendix A), we can determine the efficiency of p processors applied 


to the Gauss problem. This efficiency data is shown in Table 6.3. 
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Figure 6.1: Efficiencies for GF (PC) on the iPSC/2 


Many different graphical displays of this data would be interesting, but the efficiency 
data may be the most interesting since it sort of captures the success or failure of a 
parallel program (i.e., poor efficiencies should lead us to question the parallel nature 


of the algorithm). Figure 6.1 shows a scatterplot of the data from Table 6.3. 
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iieeeg 4: EXECUTION TIMES FOR GF(PC) ON THE TRANSPUTERS 


0. ie 
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6.7087 
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13.6538 
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2. Data for the Transputer System 


Using the same methods, the timing (Table 6.4), speedup (Table 6.5), and 
efficiency (Table 6.6) data for the transputer system is determined. Unfortunately, 
the memory limitations of the transputers used for this work prevented comparisons 
for large problem size. Empty portions of Table 6.4 signify inavailability of data (i.e 
execution failure due to inappropriate or excessive problem size). The maximum 
problem size that executed successfully for each configuration is listed on the last 


line of the Table. Figure 6.2 shows a scatterplot of the data from Table 6.6. 
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TABLE 6.5: SPEEDUPS FOR GF(PC) ON THE TRANSPUTERS 


iy Pte on a Hypercube of Order 
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iB 6.6: br PICIENGIES FOR GF(PC) ON THE TRANSPUTERS 
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Figure 6.2: Efficiencies for GF (PC) on Transputers 
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Be GAUSS WITH PARTIAL PIVOTING 


1. Data for the iPSC/2 System 


Table 6.7 shows the timing data for execution of the Gauss Factorization 
(partial pivoting) codes (gfpphost.c and gfppnode.c) on the Intel 1PSC/2 system. 
The speedup data that is shown in Table 6.8 is derived from these execution times. 


Speedup was calculated using the usual formula (see Appendix A for details) 


for speedup on p processors. Given the execution times and speedups presented in 


Tables 6.7 and 6.8, and using the formula 


pee 
me Pp 


(as defined in Appendix A), we can determine the effectiveness (efficiency) of p 


processors applied to the Gauss problem. This efhciency data is shown in Table 6.9. 
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TABLE 6.7: EXECUTION TIMES FOR GF(PP) ON THE iPSC/2 


Se 
0.109 0.130 


| sell Time (seconds) on a Hypercube of Order | 
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§ 
16 
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31.204 
35.865 
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459.740 
900.536 
653.070 
167.616 
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TABLE 6.9: EFFICIENCIES DOR Gra ON ©, 2 
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Figure 6.3: Efficiencies for GF (PP) on the iPSC/2 


Here, again, only the efficiency is plotted. Figure 6.3 shows a scatterplot of the data 


from Table 6.9. 


Nth 


2. Data for the Transputer System 


Using the same methods; the timing (Table 6.10), speedup (Table 6.11), and 
efficiency (Table 6.12) data for the transputer system is determined. Unfortunately, 
the memory limitations of the transputers (32 kilobytes per node) used for this 
work prevented comparisons for large (interesting) problem size. Empty portions of 
Table 6.10 signify inavailability of data (i.e., execution failure due to inappropriate 
or excessive problem size). The maximum problem size that executed successfully 
for each configuration is listed on the last line of Table 6.10. The minimum problem 


size for the hybrid cube on 16 processors was one where the dimension of A was 


nn =a, 
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TABLE 6.10; EXECUTION TIMES FOR. GF(PP) ON THE TRANSPUTERS 


cals Time oat) Jona ‘euiai of a 
(n) 


2.3606 
PANO 1 
2.9546 
3.2910 
3.6606 
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TABLE 6.11; SPEEDUPS FOR GP(PP) ON THE TRANS EU tiie 


Diente | peele oa Pie 
ce So 

0.997 

Ole 

1.083 

1.184 

1.333 

1.493 

1.669 

1.818 

2.046 

2.208 


2.039 
2.667 
2.8093 
ZS 
Seclg 
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mb ol2: EF PIGlIENCIES FOR GF(PP) ON THE TRANSPUTERS 


aa 
(n) 


— (percent) on a Hypercube of Order | 
— es | 

12.459 

ee Fass 

$3530 

14.805 

16.667 

18.666 

20.859 

22a 

MS) 

28.220 

31.744 

33.343 








35.657 
37.475 
40.241 
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Figure 6.4: Efficiencies for GF (PP) on Transputers 


Figure 6.4 shows a scatterplot of the data from Table 6.12. 
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VII. CONCLUSIONS 


I value the discovery of a single even insignificant truth more highly than all 
the argumentation on the highest questions which fails to reach a truth. 


— GALILEO (1564-1642) 


A. SIGNIFICANCE OF THE RESULTS 


1. Communications and Computation 


Perhaps one of the most obvious effects that can be noticed in the results 
of Chapter VI is the abysmal performance of the complete pivoting code when com- 
pared to the partial pivoting implementation. The relatively small amount of extra 
communications required for the complete pivoting algorithm seems to force syn- 
chronization delays, thus reducing the system’s performance. This demonstrates the 
criticality of balancing communications with calculation in parallel processing. The 
conclusion, for this problem, is that parallel designs must minimize the frequency of 
synchronizing events and minimize the communications volume on occasions when 
communication is necessary. The greater the amount of uninterrupted work that a 
processor can accomplish, the better. While control, i.e., blocking communications, 
synchronization, loop—by-loop data distribution, is necessary it will have adverse im- 
pacts on performance. The individual processors of a multiprocessor system should 
be granted the maximum degree of independence that the mission will allow. 

While there is undoubtedly some room for improvement in the complete 
pivoting code, it would appear that maximum efficiencies of approximately 22%, 
40%, and 70% for hypercubes of order three, two, and one, respectively, are likely on 


the iPSC/2. The same code seems to be headed for somewhat better performance 
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on the transputers, but with the shortage of memory, it is difficult to extrapolate 
and determine the direction of the plots. The higher order cubes appear to flatten 
at about the same efficiency that the iPSC/2 showed as a terminal efficiency. 

The partial pivoting code, on the other hand, exhibits the kind of charac- 
teristics that we like to see in parallel code. Both systems show efficiencies rising 
sharply (again, the size limit for the transputers is unfortunate) and the iPSC/2 


shows some very nice results as the dimension of the matrix exceeds about 250. 


B. THE TERAFLOP RACE 


One of the biggest challenges to parallel computing today can be found in the 
“teraflop race”. There are at least three competitors with teraflop initiatives: the 
United States, Europe, and Japan. The United States effort centers around Intel 
with projects like Touchstone (Chapter I). The European effort relies on the T9000 
transputer. Considering the three to five year old technology used for this research, 
together with the numbers that the various parallel computer designers boast today, 
it seems that we might see teraflop performance by the mid—1990s. C. Gordon Bell 
claims that the teraflop is conceivable [Ref. 6: p. 1099] 


Two relatively simple and sure paths erist for building a system that could 
deliver on the order of 1 teraflop by 1995. They are: (1) A 4h node multicomputer 
with 800 gigaflops peak or a 32K node multicomputer with 1.5 teraflops. (2) A 
Connection Machine with more than one teraflop and several million processing 
elements. 

Current products suggest that INMOS and Intel will be among the most likely 
competitors. Table 7.1, adapted from Jack Dongarra’s report [Ref. 8: p. 20], shows 
how transputer—based systems compare to Intel products. This Table summarizes a 
test involving the solution for a 1000 x 1000 system of linear equations. The proces- 


sors used for my thesis show floating-point capabilities of 0.37 Mflops (T800-20) and 
0.16 Mflops (Compaq 386/20 with 80387) in Dongarra’s report {Ref. 8: pp. 14, 16]. 
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TABLE 7.1: PARALLEL MACHINE COMPARISON 


[Computer Ta Tp Ly [Speedup | Bficieney | 


Parsytec FT-400 
Parsytec FT-400 
Parsytec F T-400 
Parsytec FT-400 


Parsytec F T-400 
Intel iPSC/860 
Intel iPSC/860 
Intel iPSC/860 





The iPSC/860 illustrates the most recent technology and shows excellent uniproces- 
sor performance (6.5 Mflops) [Ref. 8: p. 9]. The T800 transputer that Parsytec 
used is somewhat dated and will soon be replaced by the T9000. Nevertheless, the 
transputer—based system shows good parallel performance. The times of execution in 
the experiments of this thesis also indicate that the T800 is faster for floating-point 


calculations than the 386/387 combination in the iPSC/2. 


C. FURTHER WORK 


My research suggests many areas for further investigation. The method of 
conjugate gradients shows a great deal of promise as a candidate for parallelization. 
Indeed, it was the original aim of this thesis, but the development of other portions of 
the code required a great deal of time. The parallel CG algorithm should be relatively 
simple to code and holds great potential with respect to performance. Additionally, 
it possesses a nontrivial derivation and the theory behind the algorithm would be 
interesting to develop. 

There are many other variations on Gauss factorization that could be coded 
and tested. While the programs presented in this thesis are designed in an effort 
to produce efficient performance, there is undoubtedly much that might be done to 


enhance this code. Among the options: at a very basic level, we could begin with 
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other distributions of the matrix A. A block method or row method may actually 
yield better performance. As the LINPACK benchmarks seem to use blocks, this is 
probably worth pursuing. 

General purpose parallel computing, the ability to rely on parallel architectures 
for general purpose computation without a need for investigation to be more con- 
cerned with the architecture than the problem being computed, still requires much 
work. The ability to use parallel architectures as a computational tool to solve 
problems will mark an increasing maturity in this field. 

Applying object-oriented design and programming paradigms to the parallel 
world may hold a great deal of promise. In particular, the C++ language seems to 
be a prudent choice for parallel programming. 

In addition to the more practical options, the study of parallel theory and al- 
gorithms seems interesting and shows a great need for development. In particular, 
this field seems to need a more-or-less general (at least for MIMD machines) ap- 
proach to classifying parallel algorithms and specifying their performance. As noted 
in Chapter IV, a mixture of this field with graph theory may hold a great deal of 
promise. 

On an initial glance, the use of the Ada programming language with its inbuilt 
tasking constructs might seem optimum for the type of computing investigated in 
this thesis. Ada, in this regard, however, is optimized for use with shared memory 
multiprocessors. The use of Ada on transputers still requires much experimentation 
and better tools. Presently only one, rather expensive, Ada compiler is available for 
transputer use. Its required use of occam harnesses makes using Ada on transputers 
awkward at best. Further research is needed to create a better environment for Ada 
programming on transputers. Given the significance of Ada to the DoD establish- 
ment, this should become a priority. The inclusion of a standard math package and 


the advent of Ada 9X may hold some promise in this regard. 
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APPENDIX A 
NOTATION AND TERMINOLOGY 


This appendix explains the shorthand used in the rest of the thesis. Con- 
ventions, by definition, are generally accepted rules of the business. This would 
seem to obviate the need for further discussion of conventions, but there are sev- 
eral good reasons for discussing notation and terminology. First, the notation may 
not be conventional]. In the absence of convention (or when the foundation that it 
provides is inadequate) a more substantial agreement is required. Second, even for 
conventional notation, the audience may be diverse enough to warrant familiariza- 
tion. The following discussion provides this familiarity and gives the terms of an 
agreement to establish the meaning of the words and symbols used in the rest of 
the work. On occasion, neither convention nor this agreement will suffice. These 
situations will be handled case-by-case with the philosophy that clarity should 
never be sacrificed for brevity. 


A. BASICS 


Most of the work deals with the integers, Z (from the German word for numbers, 
Zahlen), the set of real numbers, R, and the complex numbers, C . Often, the 
German & is used to represent the reals. A complex number is a number, r + ty = 
z €C, that has a real part (x € R) and an imaginary part (y € R), with the compler 
unit 1 = /—1. Sometimes the real part is denoted Re(z) and Im(z) is used to 
represent the imaginary part. 

A scalar is simply a real number, and is usually denoted by a lower-case Greek 
letter.! A vector is an ordered set of scalars. Lower-case Latin letters like b, 2, and 
y are used to denote vectors. Sometimes an arrow is placed above the name of a 


vector—like F—to emphasize the fact that it is a vector. 


1'The Greek alphabet is shown in the Table of Symbols. 


EZ 


Matrices are two dimensional and usually contain real or complex elements. 
Capital letters (Greek or Latin) are used to represent matrices. Common examples 
include A, P, Q, #, A, and &. 

The number systems introduced above cannot be represented in a finite space. 
There are two basic problems. First, we should consider the size (or cardinality) of 
the sets. The integers are countable or denumerable since there exists a one-to-one 
mapping between Z and the natural numbers, N. This is an advantage in finite 
storage since it means that we can choose a finite range of the integers and be quite 
certain that every integer in that range is represented (exactly). Even though Z is 
denumerable, it is a set with infinite cardinality. 

The real numbers present a more difficult situation for finite storage. The real 
number line is dense in comparison to the integers. ® is not only an infinite set, it is 
not countable (1.e., R is uncountable). It is said to have the power of the continuum. 
To represent a real number, 2, we use the floating-point approximation, fl(x), to z. 
This is a number that may be described by three parts: the sign s, the exponent e, 


and the mantissa d. An illustration of such a number is provided in Chapter I]. 


B. COMPLEX NUMBERS 


1. Notation 


The previous section introduced one notation for complex numbers; namely, 
z = x+y. There are several other representations, each of which makes its own 
contribution in practical use. Electrical engineers usually replace the 2 with 7 since 2 
is used to represent electrical current. Since the complex number can be represented 
by an ordered pair of real numbers, the graphical notation of Figure A.1 is natural. 
In this plane, the real and imaginary axes are used to represent the components of 


a complex number. 





Figure A.1: The Complex Plane 


The vector sum of these two parts, z = 7+ y, is an equivalent and useful 
way to model complex numbers. There is yet another way to describe z. Let r be the 
magnitude of the vector z and let @ be the angle measured from the positive rea] ae 
counter-clockwise to z. Using this notation, we could use trigonometry to describe 


the complex number as z = r(cos@ +72sin0). The Euler formula [Ref. 32: p. 74], 
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e? = e™*!¥ = c7e'¥ = e7 (cosy + isiny), (A.1) 


can be used to convert a complex number to yet another form: z = re’. 


2. Operations 


a. Addition and Subtraction 


Addition and subtraction of complex numbers is performed in the same 
manner that vectors are added or subtracted. For instance, let z} = a +726 and let 


z2 = c—id. Then the sum, 2 + z2, is the same as the sum of the corresponding 


so-[s}eLo]-[e] 


so the sum is 2} +22 = (a+c)+2(b—d). Differences are handled in the obvious way, 


vectors: 


as vector differences. 


b. Muluplication 


Multiplication is performed by applying high school algebra. For the 


same complex numbers 2; and 29: 
zy X 29 = (a + 7b)(c — id) = ac — (a)(2d) + (2b)(c) — (2b) (2d) (A.3) 


and using the definition of the complex unit, 2 = ~—1, we may combine the middle 


terms and move the 7? = —1 outside the last term to find the (complex) product: 


Z) X 22 = ac —2(ad — bc) + bd = (ac +4 bd) — i(ad — bc) (A.4) 


c. Conjugation 
The compler conjugate of a complex number z = x + 2y is defined as 
z= zr -—1y. This simple operation finds practical application in complex division. 
d. Division 


Consider the quotient (z,/z2) of the same complex numbers that were 


used in equations A.2, A.3, and A.4. If we multiply both the numerator and the 
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denominator by the complex conjugate of the denominator, 22, we have: 


z) atib_ (a+ib)(c+id) — ac +i(ad) + 2(bc) + 2°(bd) 





ote (CSG daa aa (A.5) 
and then, by applying i —1, we conclude: 
z,}  ac—bd+i(be+ad)  (ac—bd) (bc + ad) (A.6) 


29 Cd (ce? + a2) “(2 + d2) 
As a practical matter, this is not the way we would compute a complex quotient. 
The code given in Appendix F (function ediv() in complex.h) provides a method 


that is better suited to the finite precision environment. 


C. VECTORS AND MATRICES 


1. Columns and Rows 


Vectors are ordered collections of scalars represented as columns. Let 


a,8,y€ Cwith a = 1.04 74.0, 8 = 2.0 — 75.0, and y = 3.0 + 76.0. Then: 


Q 1.0 + 24.0 
on me el Pe et 
y 3.0 + 276.0 


If row-orientation is intended the transpose is used: 
27? =[a B y)]=[(1.04 14.0) (2.0 —15.0) (3.0 + 26.0) ] 


Matrices may be formed as ordered combinations of elements, vectors, or blocks. 
Suppose that » = 3.0 and vy = 7.0. Then, with z as given above, the following 


matrices are equivalent: 


1.0+ 74.0 3.0+712.0 7.0 + 228.0 
A=|a pr vr |=} 2.0-i5.0 6.0-i15.0 14.0—135.0 (A.7) 
3.0+76.0 9.04+718.0 21.0 + 742.0 


An element within a matrix is usually denoted A(z,7), where z is the row index and 


j is the column index. For instance, A(1,3) = 7.0 + 228.0 in (A.7). 
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A block of the matrix A is a rectangular matrix B within A. MATLAB 
notation is useful. For instance, B = A(z: 7,k: 1) means that B is the block of A’s 


rows 2? through 7 and columns & through /. The row or column ‘:’ 


means all rows or 
all columns. For instance: 


10+24.0 3.0+ 212.0 
B= ACG, 1: 2) =) 20 = 75.06.00 (A.8) 
3.0+726.0 9.0+7218.0 


As a sidenote, a number with a decimal point should usually be taken as 
a real number. Mathematically speaking, 1 = 1.0. But many compilers treat 1] 
as an integer and use the decimal point to recognize 1.0 as a floating-point value. 
Therefore, all of the code associated with this work and most of the examples use 
the decimal point as a clue that the number is a real number or its floating-point 


approximation. 
2. Conjugation and Transposition 


The conjugate of a vector or matrix is simply a vector or matrix whose 
entries are the conjugates of the original entries. A superscript C is used to denote 


the conjugate of a vector or matrix. For instance, with A as given A.7, 


10—724.0 3.0—7212.0 7.0 — 728.0 
A® = | 2.0415.0 6.04715.0 14.0 + 735.0 (A.9) 
3.0 — 26.0 9.0—7218.0 21.0 — 242.0 
The transpose of a vector or matrix, denoted with a superscript T, refers to 
a transposition of its rows and columns. With A € C™™", the effect of transposition 
is that A(z,j) = A7(j,2) for all ¢ such that 1 <i < m, and all j sothat 1 <j <n. 


For example, consider the transposition of the matrix A that is found in equation A.7. 


zt 10+74.0 2.0-75.0 3.04+76.0 
Nee | EON El lS) OST (A.10) 
yr? F021 28.0 140013 5.0 0a 
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In this example we see that the columns of a matrix become the rows of its transpose. 
This example also demonstrates that when we first transpose, and then stack the 
columns of a matrix, we arrive at the transpose of the matrix. In the event that 
A= A’, we say that A is symmetric. 

The conjugate (or Hermitian) transpose of A is A”. This matrix is the 
result of combining the conjugation and transposition operations on A. The following 


example shows the Hermitian transpose of A: 


10-740 204750 30 -76.0 
A”# = | 3.0-—7212.0 604715.0 9.0 —718.0 (A.11) 
7.0—128.0 14.04 135.0 21.0 — 742.0 


If A = A”, we say that “A is Hermitian.” We should never confuse “A is Hermitian” 


with “A Hermitian” (the conjugate transpose, A”, of A). [Ref. 33: p. 294] 
3. Zeros 


It could be argued that zero is the most important number. In addition to 
its use as a number, zero is also used to represent a vector or matrix in which every 
element is equal to zero. In the (extremely rare) event that the context does not 
clearly indicate the size of a “0-vector” or “O—matrix”, its size will be given explicitly. 
In the absence of implied or specified size, 0 should be interpreted as the number 
zero. Additionally, blank space within a matrix usually means that all elements in 


that region are zero. 
4. Special Forms 


a. Axis Vectors 


An aris vector, e;, is simply the i" column (or row) of the identity 


matrix. 


ji 


b. Lower Triangular 


A lower triangular matriz, usually denoted L, has the form 


x 
ex (A.12) 
x 


x 
x x 
If L has ones on the diagonal, it is called unit lower triangular. Similarly, the upper 


triangular matrix U has the form 


x 
= x (A.13) 


KX K X 


U is called unit upper triangular if the diagonal elements are all ones. Sometimes 
(e.g., Chapter III) such a matrix is called right triangular and denoted R. When the 
matrix is not square, the lower and upper triangular ideas are translated to lower and 
upper trapezoidal, with the unit trapezoidal matrices having ones on the diagonal. 
The following matrices illustrate the different kinds of trapezoidal matrices. The 


matrices may be tall and skinny as 


x RW xX x 
2 4 x x 
Us 4 L= x KR Roe (A.14) 
<<  eo 
mm OK SX 
or short and fat 
x K xX ox x 
— xe xX (Ve) Me (A.15) 
x URW X& x x 
D. NORMS 


The information below was taken from [Ref. 21: pp. 53-60], so it seems fitting 


to begin with a few of Golub and Van Loan’s comments on norms. 
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Norms serve the same purpose on vector spaces that absolute value does on 
the real line: they furnish a measure of distance. More precisely, R” together with 
a norm on R" defines a metric space. Therefore, we have the familiar notions 
of neighborhood, open sets, convergence, and continuity when working with vectors 
and vector-valued functions. 


1. Vector Norms 


a. Definition 


A vector normon &” is a function f : R" 4 KR that satisfies the following 


properties {Ref. 21: p. 53}: 


fle) 20 TER, (f(z) =0iff r=0) (A.16) 
f(r+y)S f(z) + fly) ry eR (A.17) 
fee = (Ola). oesur eae (A.18) 

We denote such a function with a double bar notation: f(z) = || z ||. 


b. The p—Norm 


Subscripts on the double bar are used to distinguish between various 
norms. The most popular example of this is the p-norm, |] - ||,. This norm is 


defined by [Ref. 21: p. 53] 
1 
esa B= (| zy JP+---+ faz, BE D2 |e (A.19) 


The 2-norm is the one used most frequently in this work, but the 1—- and oo-norms 
find frequent application in other work. A natural representation of the 2-norm is 


the square root of an inner product 
lz llo= (lar Pte: fan (7)? = WaT 2 (A.20) 


The 2-norm of z is the Euclidean length of the vector z. 


iz 


2. Matrix Norms 


a. Definition 
A matrix norm on ¥#™" is a function f : R™*" — RK that satishes 


properties similar to those presented im the vector case [Ref. 21: p. 56): 


f(AY>0 AER™™, (f(A) =0iff A=0) (A.21) 
fA+B)< (4+ /(B) Awe Re (A.22) 


S(@A) =|os| [(A) ao € R, Ave kO™ (A.23) 


Matrix norms also use the double bar notation: f(A) = || A |]. The Frobenius norm 


and the p norm are the most common matrix norms 


b. Frobenius 


The Frobenius norm is defined as 











| A [r= (Ao 
c. p»-Norms 
The p norm of a matrix, A, is defined by 
Ar 
2h |, ast I] Ax hy (A.25) 


r#0 || t |[p 


EK. LINEAR SYSTEMS 


One of the fundamental tasks of linear algebra is to form a matrix representation 


of a system of linear equations. Consider the system of linear equations: 


2u, + 3uop — 4u, = 7 
ay, — We + tg (A.26) 
Ju, + Ot. -— 2u3, = 1 
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This system of equations can be expressed using the matrix notation Au = b 


2 3. —4 uy 7 
Au=|3 -5 7 Vey jc |e aaa (Anz) 
4 6 —-2 U3 ] 


F. MEASURES OF COMPLEXITY 


The first, and most rudimentary requirement for an algorithm is that it produce 
the correct answer. This seems utterly obvious, but it must never be lost in the 
algorithm designer’s pursuit of the next most important elements—efficiency in using 
time and space. For the moment, we shal] assume that the algorithm arrives at an 
acceptable answer. Then the algorithm’s use of time and space becomes a very 
serious subject. Knuth provides the notation in [Ref. 34]. 

The time complezity of an algorithm, also known as running time, describes how 
the program works under a stopwatch. Space complerity is the amount of temporary 
storage required to carry out the algorithm. For example, suppose a person stood at 
a chalkboard. ready to solve a problem. We would not regard the input or output 
storage space, but only the required space on the chalkboard, in the space complexity 
of the problem. Usually we like to link the idea of complexity to the input size of the 
problem, n. The following discussion of time complexity outlines a few tools that 
are standard in the study of algorithms. The same tools and ideas apply for space 
complexity analysis. [Ref. 35: pp. 42-43] 

The most common method for describing the time complexity of an algorithm 
is the “big-Oh” notation [Ref. 35: p. 39].? A function g(n) is O(f(n)) if there exist 


constants c and WN so that, for alln > N, g(n) <cf(n). 


g(n) = O( f(n)) => g(n) Sef(n), n2N (A.28) 


“O(f(n)) is read “order f(n).” 
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This means that for a large enough problem size n, the time to execute g(n) is a 
constant multiple of some function, f(r). Big-Oh notation does not mean a least 
upper bound, only an upper bound for n sufficiently large. Practically, O( f(n)) must 
be augmented so that we may determine how tightly cf(n) bounds g(n). 

By adding a lower bound to big-Oh, we may arrive at a more informative 
statement concerning an algorithm’s complexity. This is achieved through the use of 
“big Omega”. T(n) = Q(g(n)) means that there exist constants c and N such that, 
for all n > N, the number of steps T(n) required to solve the problem for input size 


nis at least cg(n). 
ey — SiGe) ) <a ca eer (A.29) 


This is essentially a lower bound on time complexity. If a function, f(n) satisfies 
both f(n) = O(g(n)) and f(n) = Q(g(n))—not necessarily using the same constants 
cand N for both O and N—then we say that f(n) = O(g(n)). [Ref. 35: p. 41] 


f(n) = O(g(n)) = Q(g(n)) => f(r) = O(g(n)), n ZN (A.30) 


Now and then, notation similar to O and 12 is required except that a strict inequality 


is desired. In this case, we use “little oh” and “little omega”. The definitions are: 


f(n) = o(g(n)) => jim 





= = 0 <=> g(n) = u(f(n)) (A.31) 


We have seen that O, 2, O, 0, and w are roughly equivalent to the inequalities 
<, >, =, <, and >, respectively. Is this notation meaningful? Does it have utility in 
problem solving? The answer is a guarded “yes.” We must understand the purpose 
of the notation. It cannot substitute for timing data taken from the actual execution 
of an algorithm. It is intended as a good first estimate. There are too many variables 
involved in modern tools and machinery to expect accurate analysis from other than 


actual execution. 


TABLE A.1: ALGORITHM COMPLEXITY AND MACHINE SPEED 


Algorithm Execution Time (in Seconds) for Machine Speed 
S00 steps] 
0.01 0.005 0.001 
] 0.5 0.125 


10 4) 1.25 


32 16 4 


1,000 500 250 125 
1,000,000 500,000 250,000 125,000 
10° 10° 10° 10°" 





Nevertheless, a rough estimate of how a problem grows is important. to the prob- 
lem solving process. Indeed, experimental results and complexity analysis should not 
usually be considered independently, but compared and used as complementary in- 
struments. The time complexity of an algorithm is, in a sense, more important than 
the speed of the machine upon which it is executed. Consider the data in Table A.1] 
(adapted from [Ref. 35: p. 41]). This is based upon a problem of size n = 1000 and 
demonstrates the ability of an algorithin to dominate a machine. For tlis reason, 
and with these conditions clearly established, we will find many occasions to use 
time- and space-complexity notation. 

Finally, the two most common performance measures for parallel computing 
are speedup and efficiency. Suppose that 7), is the tine of execution for a particular 
algorithm, A, on n processors. Consider the best uniprocessor time 7} for a sequential 
version of A compared to the execution of an equivalent (not necessarily the saine) 


parallel program on P processors that executes in time Tp. Then speedup, Sp, is 





defined as 
fi 
oP = Tp 
and the efficiency, Ep, is defined to be 
Sp 
jy op P 
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APPENDIX B 
EQUIPMENT 


A transputer is a microcomputer with its own local memory and with links 
for connecting one transputer to another transputer. 

The transputer architecture defines a family of programmable VLSI com- 
ponents. The definition of the architecture falls naturally into the logical as- 
pects which define how a system of interconnected transputers 1s designed and pro- 
grammed, and the physical aspects which define how transputers, as VLSI compo- 
nents, are interconnected and controlled. 

A typical member of the transputer product family is a single chip containing 
processor, memory, and communication links which provide point to point con- 
nection between transputers. In addition, each transputer product contains special 
circuitry and interfaces adapting it to a particular use. For example, a peripheral 
control transputer, such as a graphics or disk controller, has interfaces tailored to 
the requirements of a specific device. 

A transputer can be used in a single processor system or in networks to build 
high performance concurrent systems. A network of transputers and peripheral 
controllers is easily constructed using point-to-point communication. 


— INMOS 


This introduction is provided by the transputer’s maker in [Ref. 36: p. 7]. 
A. TRANSPUTER MODULES 


INMOS makes a wide variety of microprocessors to suit differing needs. To 
provide a simple, modular interface they have developed the notion of a transputer 
module (TRAM). The TRAM is a small board containing the microprocessor, RAM, 


other circuitry, and a standard sixteen signal interface. 


B. THE IMS BO012 


Most of the later experiments were carried out on an IMS BO12 board. This 


board accommodates sixteen transputers; each of which is installed on its own IMS 
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B401 TRAM. In our case the TRAM holds 32 kilobytes of memory (in addition to 


the four kilobytes onboard the T800-20 transputer). 


d. INMOS Transputers 


The INMOS transputer gives the system designer a tremendous amount 
of latitude. With these processors—perhaps more than with any other parallel 
architecture—one should give careful thought to the size, component processors, and 
interconnection topology as the first elements in designing a solution to a problem. 
This cannot be overemphasized. When the hardware is not “general purpose” in na- 
ture, it must receive thoughtful consideration along the path to solving the problem. 
Some of the largest applications for parallel machines—especially for transputers— 
are embedded systems. 

An embedded computer system is defined as “one that forms a part of 
a larger system whose purpose is not primarily computational.” [Ref. 37: pp. 15-16] 
To automatically accept or assume a particular machine configuration is to relinquish 
control of one of the tools available in system design. 

Transputer is the name given to the members of a family of microproces- 
sors. While INMOS is the largest producer of these processors, they have not chosen 
to protect the name transputer with any sort of trademark. The name comes from 
a combination of “transistor computer” and each transputer is essentially a com- 
puter on a chip. The chip possesses an arithmetic logic unit (ALU), memory, and a 
communication system that supports bidirectional serial communication links. Most 
of the transputers used for this research also include a 64-bit (IEEE 754 standard) 
floating-point unit (FPU). 

The transputer module (TRAM) is the most common package for trans- 
puters. The capabilities of these modules are quite diverse, but they hold to a 


standard interface design. This makes the TRAM easy to use. Systems designed 
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around TRAMS enjoy simple replacement of components, ease of modification, and 
great scalability. Indeed, the laboratory environment in which these TRAMs were 
exercised is a very dynamic one. 

The PARCDS laboratory has six 80286-based IBM-compatible personal 
computers, each of which contains a transputer interface board. Five hold IMS B004 
boards and one holds a Transtech TMBO08 board. The B004 boards each have two 
megabytes of memory and an IMS 7414 transputer in addition to the requisite 
serial-to—parallel converter and interface circuits. The TMB08 holds four megabytes 
of memory and an IMS T800-20 transputer. These “host” machines can each be 
connected to an arbitrarily large network of transputers. 

For this purpose, we have two INMOS Transputer Evaluation Module 
(ITEM) boxes. These boxes can hold at least ten boards of the Double Eurocard size 
(approximately 22 cm x 23.5 cm). Of primary interest for this thesis was the IMS 
B012 board; a motherboard capable of supporting sixteen TRAMs. For this research, 
al] sixteen slots were filled with a TRAM that held an IMS T800-20 transputer and 
32 kilobytes of TRAM memory (in addition to the transputer’s four kilobytes). The 
shortage of memory is probably the greatest deficiency and indicator of the outdated 
nature of these processors. TRAMs with four and eight megabytes of memory and 
IMS T805-25 transputers are currently available for less than $900.00 and $1,300.00 


respectively. 


e. Intel iPSC/2 


The iPSC/2 used for this research contained eight node processors of 
the “CX” type (80386/80387 combination). Like the transputers, this machine is 


somewhat dated. Today’s 1860 chips have exceedingly more capacity. 
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C. SWITCHING METHODS 


The iPSC/2 and transputer hardware use of different switching methods. Intel 
uses a circuit switching approach, whereas the INMOS approach is store-and-forward 
switching. Each approach has advantages and disadvantages. The circuit switching 
approach is “almost universally used for telephone networks.” [Ref. 38: p. 12] The 
idea is to first define a path (close a circuit) from the source to the destination and 
then use it as a dedicated line. 

This requires a start-up overhead that depends entirely upon the current load 
being handled by the system. If any part of the medium (links or switches) between 
the source and destination is busy, the message will wait at the source until the 
entire path is clear. The path is determined (in the iPSC/2 case) in a deterministic 
fashion, so that a message from node 2 to node 7 will always insist on a particular 
path, even if some other communication is blocking that path. As the path becomes 
clear, switches between the source and destination are set so that a dedicated line 
will exist from source to destination. 

After the overhead of establishing (closing) the circuit has been paid, commu- 
nication proceeds at a rapid rate. The intermediate nodes along the path do not 
store the message. Instead, their switches have been set so that the message flows 
through. Intuitively, this approach should be quite effective in a network with a very 
structured interconnection topology and a relatively small number of nodes. The 
hypercube gives us this structure. Hypercubes of order three or four are probably 
small enough to avoid difficulties that might arise as many nodes contend for the 
same medium. 

The store-and-forward approach does not require the availability of the entire 
path between source and destination nodes. Instead, each node along the path ac- 


cepts the entire message in turn and then forwards it to the next node in the path. 
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This requires the use of no more than one link at a time. For a many-node environ- 
ment (particularly if there is little structure or the potential of dynamic routing), this 
approach would seem to offer some advantages over the circuit switching approach. 

The routing criteria is separate from the type of switching used. Either of 
the two general approaches described above can support many forms of routing. 
Deterministic approaches alone include many methods. For the hypercube topology 
with Gray-coded node labels, it is probably useful to combine the Gray code with 
the notion of Hamming distance to arrive at a shortest path route. Even with this 
approach, there are as many optimum paths between two nodes 7 and j as the 
Hamming distance, H(i,7), between them. [Ref. 39: p. 7]. If a dynamic scheme 
is used to determine the path, there are even more combinations of potential paths 
from 7 to 7. Usually a dynamic approach considers media utilization, “hot spot” 


avoidance, and so on. 


138 


APPENDIX C 
INTERCONNECTION TOPOLOGIES 


Multiprocessor computing brings with it a fundamental concern: interproces- 
sor communication. Communication is—to any designer of computing machinery 
or software—a burden and hindrance. An interconnection topology describes the 
network that handles this load. The hypercube is one of the many topologies used 
in multiprocessor computing. It has been the subject of both hype and criticism. 
Nevertheless, this particular scheme possesses the qualities that quickly draw the 
attention of mathematicians and parallel programmers. The hypercube’s struc- 
ture and simplicity make it dependable and predictable. The same properties that 
enable the hypercube to endure the rigor of mathematical proof lead to practi- 
cal solutions in parallel programming. This discussion describes the hypercube 
topology and explores some of the the qualities that make it a practical choice for 
multiprocessor computing. 


A. A FAMILIAR SETTING 


Organizing processors into a suitable topology is analogous to the familiar prob- 
lem of organizing personnel into groups. An independent worker has limited capacity, 
so we often set more hands (or machinery) to the task for productivity’s sake. Groups 
of people are often less efficient. Efficiency is a ratio of time spent doing useful work 
to the total time spent. Other metrics might work, but time is universally recog- 
nized as the standard against which productivity is measured. Dependence upon 
others requires communication and consumes time. The loss may be mini- 
mized, but not avoided. Any group working toward a common goal must deal with 
this problem. To be efficient, an organization must possess structure and media for 
communication. 

People spend time on meetings, paperwork, and peripheral pursuits—all for 
the sake of an organization that hopes to outperform the individual. Organizations 


typically perform tasks that are simply impossible for an individual. To be sure, an 
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individual often possesses the independence and efficiency that makes him the proper 
choice. There are tasks that seem to fit one or the other and—while there is some 
crossover in ability—we aren’t likely to get rid of either organizations or individual 
workers soon! This is worth considerable attention. Individuals and organizations 
are chosen for different tasks. 

These ideas apply in the world of parallel processing. First, there are many 
tasks. Some fit nicely onto a single processor. Others beg a parallel solution. Finally, 
some have natural solutions by either method. Even when one of these options 1s 
selected, there are many ways to solve the problem. If a multiprocessor is used to 
solve the problem, the issue of communications will be unavoidable. 

An interconnection topology must carry the burden of interprocessor communi- 
cations. There are many schemes for handling this mission. This discussion focuses 
on one design that fulfills that mission: the hypercube. To forestall confusion: the 


subject is an interconnection topology, not a particular vendor’s product. 
B. APPEAL TO INTUITION 


Productivity can suffer when the members of an organization communicate 


excessively. A lack of communication can also reduce efficiency. In a network of 


if there is a shortage of links, but with too many links a message could get delayed 
or lost in the confusion. The hypercube attempts to strike a balance. 

Hypercubes come in different sizes. In fact, scalability is a key characteristic of 
the hypercube. It allows the designer to tailor a network to a problem. There are 
several ways to express the cube’s size: order is one measure. The term “hypercube 
of order n” (usually called an n-cube) is filled with meaning. A more detailed de- 
scription is given later, but pictures provide the most direct introduction. Figure C.1 


shows hypercubes of order n where n € {0,1, 2, 3}. 


140 


® o> ———_ 2 


Order 0 Order 1 


Order 3 





Figure C.1: The Four Smallest Hypercubes 


This illustration 1s important. The hypercube shows geometry, structure, and 
symmetry. A few observations nearly jump out of the pictures. One can see several 
terms of a geometric series developing. There is also a recurrence relation at work 
in the building of hypercubes. Intuition suggests the use of well-oiled mathematical 


tools to analyze the hypercube. 
C. TOOLS 


Many benefits may be derived from a few definitions, conventions, and tools 
(that suit the hypercube’s structure). Figure C.2 demonstrates the utility of Carte- 
sian coordinates in n-dimensional space. 

The picture is deceptively simple, but worth careful study. Figure C.2 shows a 


unit cube in three dimensions. The vertex labels express (ryz) position in the coor- 


14] 





Figure C.2: Cartesian Coordinates for a 3-Cube 


dinate system. The labels also form a binary (Gray) code that is somehow equivalent 
to coordinate labeling of a cube in n-dimensional space. The issue of communica- 
tions invoked this discussion, so distance must be addressed. A comparison of the 
binary labels of any two nodes reveals that the distance between the nodes 1s equal to 
the number of bits that differ in the labels. This measure, called Hamming distance, 
and the Gray code are presented in more detail later. 

This brief introduction is just enough to embark upon a more precise descrip- 
tion of the hypercube. The ideas of a coordinate system, node labeling, and distance 
are fundamental. Graph theory also finds application in topology design. In the hy- 
percube these four tools complement each other nicely. Despite their simplicity they 
can be explored in almost endless detail, even within the constraints of hypercube 


structure. 


D. DESCRIBING THE HYPERCUBE 


The hypercube interconnection topology cannot be captured in a one-sentence 
definition. A definition is often inappropriate for material objects. A description 
given from several perspectives may be more useful. This is the case with topologies. 
Each tool introduced above has its own utility. In a sense, each takes up a particular 
perspective. A meaningful characterization of the hypercube can be achieved by 
combining these perspectives. 

The geometric view is most useful for visualizing the cubes. Despite its ten- 
dency to break down (with three-dimensional limitations), geometry’s intuitive ap- 
peal is indispensable. Geometry and pictures lay the foundation for the setting of 
an undirected graph. Figures C.1 and C.2 take advantage of geometry, but three- 
dimensional sketches begin to lose their appeal as order increases. Nevertheless, 
geometry and visual models hold an important place in describing the hypercube. 
They furnish us with (a) examples for comparison, and (b) expectations that are 
useful in the transition to a more general description of the topology. 

A hypercube of order n may be described as a set of 2” points (vertices, nodes, 
or processors) connected by a set of edges. The points are each given an n-bit 
binary label, 0, ...6362b,. Thus the hypercube’s node labels exhaust all possible n- 
bit binary combinations. Furthermore, the labeling convention used in Figure C.2 
describes the point’s n-dimensional Cartesian coordinates. 

The hypercube edge set (communication links) includes an edge between every 
pair of points p, and p, whose binary labels differ in exactly one bit position, say by. 
That is, adjacent nodes have a Hamming distance of one. This measure of distance 
proves especially convenient in the hypercube, and it can be thought of in several 
equivalent ways. A first definition of Hamming distance is the number of bits that 


differ in the two labels. Equivalently, it is the number of 1’s in a bitwise exclusive 
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or (XOR) of the numbers. Figure C.2 contains an example. Let p,; be the point 
labeled 100 and p; be 110. The binary labels differ in exactly one bit position, 
namely bz (the second bit). The points are neighbors (one hop from each other in 
communications terms). [Ref. 40] 

Despite the appeal of the geometric approach, it holds limited value in a gen- 
eral n-dimensional space. Consider n = 4 in three dimensions. Typical illustrations 
show the sixteen-node cube as a cube inside a cube with connections between corre- 
sponding nodes of the inner and outer cubes. An equivalent diagram would display 
two 3-cubes side-by-side with connections to corresponding nodes. Nevertheless, it 
seems that an n-dimensional coordinate system is the most convenient environment 


for sketching the hypercube of order n. 
E. GREATER DIMENSIONS 


Three-dimensional sketches become difficult to manage. The time comes for a 
change of method. Some of the finest tools available for spanning such a gap are 
recurrence relations and the principle of mathematical induction. The approach is 
not extremely formal, but those so inclined will not find it hard to add the formalities. 

Induction can be used to generate a Gray code suitable for labeling the nodes 
of a hypercube. This code and the Hamming distance can be used to determine 
the cube. The first topic is a procedural description of how to build hypercubes. A 
Gray code construction procedure will follow. If the two topics appear similar, it is 
because they are completely equivalent (assuming that the Gray code is combined 
with the concept of Hamming distance). 

Constructing a hypercube of order zero is trivial. This is not important except 
that it leads to greater things (i.e., it is the basis for induction). Second, suppose 
that this hypothesis for induction is true: “we know how to construct any hypercube 


of order k where 0 < k <n”. Induction forms a hypercube of order n using this 
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base case and hypothesis. This can be done in three steps: 


e Replicate the Hypercube of Order (n — 1) so that there are two identical copies. 


For concreteness, one will be copy number O and the other will be copy number 
1. The hypercubes have 2("-") nodes each. 

Prepend the copy number to the existing node labels. That is, place a leading O 
in front of the labels for each node of copy O and place a 1 in front of every node 
label in copy 1. Now every node in one copy has a corresponding node in the 
other copy. These corresponding nodes are separated by a Hamming distance 
of one. That is, the last (n — 1) bits are the same for corresponding nodes and 
they differ only in the prepended copy number. 

Connect all nodes whose labels differ only in the prepended copy number. This 


adds 2'"-1) edges between the two copies. 
GRAY CODE GENERATION 


The procedure above generates hypercubes. By focusing on the vertex labels, 


Gray code generation can be discussed. A Gray code is a cyclic list of all of the n-bit 


numbers which changes in only one bit from one number to the next [Ref. 40]. Since 


the code is binary, there are 2” numbers in the list. The starting point is arbitrary 


(it is cyclic) but I have started with zero. Perhaps the best explanation of Gray 


codes comes in the construction of one. As in the construction of hypercubes, a base 


case is required to begin generation. 


e Start with 0. This is a one-bit number (n = 1) so the one-bit Gray code must 


have a total of 2) = 2 numbers. The other is 1. Next, the hypercube building 


steps established above are applied with slight modification. 


e Given the one-bit case, it is easy to generate the n = 2 code. Write down the 


previous code and draw a line below it. Next, form a copy by reflecting the code 
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TABLE C.1: GRAY CODE GENERATION 





downward across the line. Place a zero in front of each number in the previous 
code (above the line), and a one in front of each number mn the new copy (below 


the line). 


This 1s a Gray code for n = 2. Table C.1 extends the idea. The hist is cychic, 
each number consists of n bits, and the hst contams all 2” possible numbers. To 
construct the code for larger n, the process may be apphied repetitively. Copy 
by reflecting the (n — 1)-bit code downward across a line, prepend a zero to 
everytlnug above the (most recent) line, and prepend a one to those below that 


lure. 


The Gray code is probably the most useful node labeling to attach to the hyper- 


cube. This code often appears in implementation. The program listing that begins 


on page 152 shows one way to generate the code. It can be used, for instance, as the 
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backbone of a routing function in a network. Labels with a Hamming distance of one 
mark neighbors in the hypercube. What about the labels of two nodes that differ 
in exactly & bits (i.e., have a Hamming distance of k)? It turns out that k is the 
distance (number of edges) between these nodes. For all communications between 
these nodes, the shortest path will involve & hops. 

This also indicates that, for an n-cube, there is no pair of nodes that have 
a Hamming distance of more than n (e.g., communication between nodes 0000010 
and 1111101 in a 7-cube can be achieved in seven hops). The greatest distance 
across the n-cube is n hops. In fact, for each node in a hypercube, there is a unique 
corresponding node at a Hamming distance of n. Also, there are n nodes at a 
Hamming distance of one from each of the hypercube’s nodes. 

Two approaches have been considered so far: sketching cubes in n-dimensional 
Cartesian coordinates and studying the labels associated with the cubes. Though 
the approaches are fundamentally different, they arrived at many of the same conclu- 
sions. Careful application of the Gray code and Hamming distance could produce a 
nearly endless string of results, but it is more convenient to introduce some material 
from the study of graphs at this point. Graph theory combines the two approaches: 
it looks at the pictures and studies the numbers as well. The small Reoercubes 
described with earlier methods are given graph representation in the illustration of 


Figure C.3. 


G. GRAPHS OF HYPERCUBES 


Graph theory is, of course, much more sophisticated than the small subset 
used here. Buckley and Harary provide a valuable source [Ref. 41]. This discussion 
exposes a few salient features of the hypercube from the perspective of graphs. 

A graph, H, consists of a vertex set, V(//), and an edge set, E(H). The vertices, 


or nodes, in the multiprocessor network model are the processors. The edges are the 
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igure C3: Hypercube Graphs 


links that connect the processors. | will avoid using the term: order in its graph 
theory sense (Le., number of nodes) so that it cannot be confused with the order of 
the hyperenbe. Consider the graph, //,,, of a hyperenbe of order n. The graph has 
these characteristies: 
e There are 2" nodes. ‘This means that the mimber of nodes (1.¢., processors) 
grows very quickly with order. 
e Itvery vertex, py in Hy has eccentricity e() =n. Eccentricity is the distance 
to a node farthest from vp. Additionally, each node in a hypercube has exactly 
one eceentric (farthest) node. This property means that hypercubes are unique 


eccentric node (u.e.n.) graphs. 


e The radius of a graph is the minimum eccentricity of the nodes and diameter is 
the maximum eccentricity. The hypercube is self-centered, meaning its radius 
and diameter are the same: r(H,) = d(H,) = n. This is significant because it 
says that worst-case communications distances only grow lke the order of the 
hypercube. 

e Connectivity is a measure of reliability or fault tolerance in multiprocessor net- 
works. The connectivity of a hypercube is equal to the order of the cube, n. 


The edge connectivity is also n (each node has n incident edges). 


Counting the number of nodes in a hypercube is easy. The hypercube construc- 
tion process also points to a recurrence relation that reveals the number of edges 
in a hypercube. The initial case, of course, is the hypercube of order zero with no 
edges. After this, the number of edges can be expressed in terms of the size of the 
previous cube. Suppose a hypercube of order n has gq edges. Then the hypercube of 
order (n + 1) will have 2q + 2” edges. This is because the construction procedure 
calls for two copies and 2” edges between them. 

Figure C.4 provides an example. This is the graph, H4, of the hypercube of 
order four. All of the characteristics given above are evident. Additionally, a Gray 
code labeling of the nodes is given. The recurrence relation above is useful, but it 
retains a dependence upon g. A more convenient formula would depend on n alone. 

In fact, there is a simple formula for the number of edges in the graph of a 
hypercube, but it requires a closer look at the recurrence relation. In more formal 


terms: let q(n) represent the number of edges in a hypercube of order n. Then: 


Be 0 i= 0 
Ne Naat) he 


This can be expanded and shown equivalent to: q(n) = n(2'*-")). Table C.2 


provides an example. 
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TABLE C.2: NODES AND EDGES FOR A HYPERCUBE 


Number of Nodes Number of Edges 


0 
2(0)+2°=1 
2(1) +2! =4 
2 4eee 12 
212) 22 2 
Deo) eee 
2(80) + 25 = 192 


2(192) + 2° = 448 





150 


0010 0110 


1010 1110 





Figure C.4: Graph of a 4-Cube 


H. SOURCE CODE LISTINGS 


A listing of the Gray code generation program gray.c follows. 


15] 
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====S====== PROGRAM INFORMATION ssssssssss------------- 


SOURCE gray ac 
VERSION a2 
DATE O1 August 1991 
AUTHOR : Jon Hartman, VU. S. Naval Postgraduate School 
USAGE gray 
REFERENCES: 
(1] Hamming, Richard W. "Coding and Information Theory", 2nd edition, 


edition, Englewood Cliffs, N.J.: Prentice-Hall, 1986, pp. 97-99. 


wee ee ee ee eee essttssrss DESCRIPTION sasssssccssssss----- ---- -- 


This program generates and displays the Gray code described in [1]. 


we em we ae ae ae ee ee He See esetrers=t ALGORITHM Srsssasrtrtrrseessse-------- 
Consider a b-bit Gray code beginning at zero. Let j be an integral index 
such that 0 <= j <b. Consider two b-vectors, mod_counter({J and bin(]. 
Each element, mod_counter[j], holds a count mod (2°(j+1)). Initially we 
shall set mod_counter[j] = (2°j). Furthermore, let the elements of bin{] 


represent a binary number in the natural way. That is, each element, 
bin[j] will be either 0 or 1, and bin[J will be formed so that the sum, 
( 2°70 * bin[O] + 271 * bin[1] + 2°2 * bin[2]) + ... ), represents the 


‘value’ of bin[]. We have elected to start the code at zero, so let 
bin[] be set to zeros initially. Next perform this algorithn: 


for (i = 0; i < (2*b); it4) { 
Print the "binary number" represented by bin(Q). 
for () = 0; 406 Diajes) 4 
Let mod_counter[j] = (mod_counter[j] + 1) mod (27(j+1)) 


If mod_counter[j] == 0, then toggle the bit in bin[j] 
(i.e., bin[j] = (bin[j] XOR 1) ). 


} end for(j) 


} end for(i) 


51 

52 

53 #include <stdio.h> 

54 

55 

56 

57 #ifndef EXIT_FAILURE 

58 #define EXIT_FAILURE 1 

59 #endif 

60 

61 

62 #ifndef SUCCESS 

63 #define SUCCESS 0 

64 #endif 

65 

66 

67 #define POW2(n) (Gye << 7) 

68 

69 

70 

71 

ia 

73 main() { 

74 

75 int patience = 5; /* there’s a limit to my patience! */ 
76 

77 long b = 0, /* as in b-bit Gray code =, 
78 *bin, /* as described above */ 
79 sy /* generic integral values +/ 
80 ie 

81 ilies /* length of Gray code (27b) »* / 
82 *mod_counter; /* as described above * / 
83 

84 

85 Prince (  \N\A ANH \n-~<--==== "); 

86 printf("This program generates the binary numbers of a Gray code. ae 
87 printf ("====----\n\n\n"); 

88 

89 prantt, Successive numbers in a Gray code differ in exactly "); 
90 printf("one bit position.\n"); 

91 

92 printf (" The list generated by this program will be complete. "); 
93 Pranti¢ dhat is, ifsyourn™) 

94 

95 prantt(”’ request the code of numbers that are b-bits long, "); 

96 printf("“you will get a list\n"); 

97 

98 prantt(” of (2°b) binary numbers, starting with zero.\n\n\n"); 

99 

100 


101 
102 


142 


147 
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/* The sole purpose of this while() loop is to get the value of b */ 
while (b <= 0) f{ 


printf(" Please enter desired length (binary digits): "); 
scanf("%.d", &b); 
fflush(stdin); 


peinte( “nn. 
if be ©) on /* else ask again (patience permitting) */ 
1 = POW2(b); 
if (1 <= 0) { /* guard against too many left shifts! */ 
jopeawgiencl( The acceptable range is "); 
printi(’ 1). Ad. “ye (sicect ( Jongiies—2)). 


printf("Please try again.\n\n\n"); 


Ore 1: 


if (--patience <= 0) { 


jersolaahiees ¢ Ran out of patience!\n"); 
exit(EXIT_FAILURE) ; 
‘ 
} /* end while (b <= 0) */ 


/* Allocate storage for the arrays, test to see if it worked */ 
bin = (long*) calloc (b, sizeof(long)); 
mod_counter = (long*) calloc (b, sizeof(long)); 


if (('bin) || (!mod_counter)) { 


printf("main(): Allocation failure bin{] or mod_counter[].\n"); 
exit (EXIT_FAILURE) ; 


/* Initialize mod_counter[] ¥*/ 
for (i = 0; i < b; i++) mod_counter[i] = POW2(i); 


printf (" Gray code for %ld bits will generate ", b); 
printt @ Aldsumbers. \n\nvn. 1): 

print ic. Press RETURN to continue...."); 
fflush¢stdin)- 


1 = getc(stdin); 
princes Wann | 
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151 /* Do the for() loop spoken of in the “ALGORITHM” section above */ 
152 

153 HomeGre="0; i <a i++) { 

154 

155 /* Print the binary representation held in bin[] */ 

156 PeinteCat 4 

157 

158 foreGjeee ue), Jo 7= 0; j--) 1 printi@ Zid", binlj]); } 

159 

160 jones men gle Vel) p 

161 

162 

163 /* Adjust the counters using addition mod (2°(j+1)) and toggle the 
164 * corresponding bit in bin[] whenever an element of mod_counter[] 
165 * reaches zero. 

166 + / 

167 Pome Gje="0; J 410; ++) 

168 

169 mod_counter[j]++; 

170 

17) if ((mod_counter[j] %= POW2(j+1)) == 0) bin[j] “= 1; 

172 iF 

173 } /* end for(i) */ 

174 

175 free(bin); 

176 free(mod_counter) ; 

177 

178 return(SUCCESS) ; 

179 } 

180 /* -----------n-SSSssssssssse EOF gray.c SssSssssssssss------------- «/ 
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A. 


familiar French names. Pierre-Simon de Laplace (1749-1827) and Siméon Denis 
Poisson (1781-1840) made sizeable contributions to several fields. In a moment, the 


discussion turns to partial differential equations named in honor of these gentlemen. 


may provide some encouragement. The ideas are not so obvious to everyone as they 


APPENDIX D 
A SPARSE MATRIX 


Partial differential equations can be used to characterize many physical prob- 
lems. Explicit solutions to these problems are often quite complicated, so alterna- 
tive approaches warrant our attention. Simple matrices exist as legitimate repre- 
sentatives of complex problems. A system of linear equations can be constructed 
to give a discrete approximation to the problem. The structure of the physical 
setting guarantees that the corresponding matrix of coefficients will be sparse and 
symmetric. Why does this happen? When do we have the right to expect such a 
simple matrix? Where does the matrix come from and what does it mean? 

This discussion explains how to construct the matrix of coefficients and vec- 
tors that describe the numerical approximation to an elliptic partial differential 
equation. Poisson’s equation in two dimensions is used to demonstrate the process. 
The first step uses a finite difference approximation to produce a system of equa- 
tions. The system is fine-tuned and the matrix of coefficients is extracted. The 
process reveals the origins of structure and shows why the matrix is sparse and 
symmetric. 


LAPLACE AND POISSON 


To most engineers, mathematicians, and scientists, Laplace and Poisson are 


If the material seems a bit difficult, the following quote from [Ref. 42: p. 10] 


may have been to Laplace. 


Nathaniel Bowditch (1773-1838), an American astronomer and mathemati- 
cian, while translating Laplace’s Mécanique céleste in the early 1800s, stated, “I 
never come across one of Laplace’s ‘Thus it plainly appears’ without feeling sure 
that I have hours of hard work before me to fill up the chasm and find out and show 
how it plainly appears.” 
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The next several pages are dedicated to showing how the matrix representation 
of a partial differential equation plainly appears! The objective is to describe a 
particular physical problem, then convert it to the equivalent matrix representation 


using a deliberate, step-by-step approach. 
B. EQUATIONS 


Laplace and Poisson worked with partial differential equations that can be ob- 
served in nature. What kinds of natural phenomena can be described with partial 
differential equations? This section gives a brief answer to this question. The dis- 
cussion includes the natural setting, the equations, and a quick look at the variables 
and constants involved. The link between the equations and their physical meaning 
is critical, so this aspect must be developed. The heat equation has one of the most 
intuitive physical interpretations available, so it is used as a starting point. After 
developing a general perspective, the field can be narrowed to a particular example— 
Poisson’s equation. Such a limited survey of partial differential equations can only 


hope to succeed by appealling to the reader’s experience and intuition. 
1. Heat 


Before looking at a partial differential equation, let us recall some plane 
geometry. The intersection of a plane and a cone(s) provides many interesting shapes 
and equations. Consider the equation that describes all points equidistant from a 
point (focus) and a line (directrix): 


ve (...) 2? 4k. (D.1) 


This is a parabola whose focus and vertex both lie on the y-axis (the axis of the 


parabola is the y-axis). The focal length is c and the vertex is located at (0, k). 
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Partial differential equations are classified using conic sections much like 
equations in the ry-plane. Introductions to partial differential equations often begin 
with the heat equation: 

Ou Oru 


7) a Ka + Oz, 1). (D.2) 


This is an example of a parabolic partial differential equation. Note the similarity of 


equations (D.1) and (D.2). 


a. Definitions and Notation 


The heat equation describes the temperature, u(z,t), in a “thin rod” 
(the single dimension z appears in the equation). The presence of t indicates depen- 
dence upon time. If there is a heat source (or sink) present, it is represented by Q. 
We can see that Q may be a function of z or t or both. When mass density (p), 
specific heat (s), and thermal conductivity (4’) are known; the thermal diffusivity, 


kK, can be determined using the following relation: 


K=— (D.3) 


b. Houses and Heat 


From our youth, we have observed several important properties of heat 
flow. The lessons are simple, few in number, and can be observed from the comfort 
of our home. First, heat energy only flows when there is a difference in temperature. 
If the temperature outside is the same as the indoor temperature, no heat energy will 
cross the threshhold (even with the door open). A temperature difference represents 
an instability and heat will flow to counter this situation. 

When heat does flow, it goes from hotter to colder regions. The loss of 
heat energy from the warmer region reduces the temperature there, and the tem- 


perature in the colder region rises as it gains heat energy. The transfer of heat 
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has a stabilizing effect (the environment will not be at rest as long as temperature 
differences exist). We do not find the changes in temperature surprising, but our 
conversation indicates confusion concerning the direction of the flow. Most of us have 
heard someone say: “Close the door, you’re letting cold air in!”. We understand that 
this statement is not correct, but it seems to persist from one generation to the next. 
In addition to the idea that heat flows in the presence of temperature 
differences (gradients), we clearly understand that larger differences are related to 
greater heat flow. On avery cold Winter day, the parent notices more quickly that the 
child left the door open (and displays more urgency in shutting it). In other words, 
the effect of heat flow is to balance differences in temperature and it somehow “works 
harder” when there is a greater difference to balance. In mathematical terms, we 
would suspect (correctly) that heat flow is proportional to temperature difference. 
Finally, we recognize an ability to restrict heat’s ever-present balancing 
efforts. Sometimes we want an imbalance in temperature, and we often use insulation 
to maintain this imbalance. When we shut the door, we expect that it will slow 
the transfer of thermal energy through the doorway and enable us to maintain an 
acceptable imbalance in temperature. For the same reason we use special materials 
in the construction of refrigerators to keep heat out, and in ovens to keep heat energy 
inside. This means that the effectiveness of heat transfer is subject to properties of 
the medium (air, glass windows, fiberglass insulation, wood doors, steel, styrofoam, 


and so on) through which it flows. 


c. Heat Flux 


The right-hand side of the heat equation looks a bit complex, but it 
merely captures this idea of heat flow. Before tackling the second partial derivative 
of u with respect to xz, think about the first partial derivative. The first partial 


derivative of u with respect to zr (scaled by the thermal conductivity, A’) describes 
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movement of thermal energy. This flow of heat is usually called heat fluz, denoted 
¢, and can be calculated using Fourier’s law of heat conduction: 


(ea =hk— (D.4) 


I 


Heat flux is a measure of how much thermal energy per unit time is 
moving to the right per unit surface area (by convention, flow to the left is assigned 
a negative value and flow to the right is positive) [Ref. 43: p. 3]. The second partial 
derivative measures changes in flux with respect to position. In other words, it 


represents increasing or decreasing flux. 


d. Heat Equation Summary 


Let us carefully reassemble the pieces of the heat equation (D.2) to see 
if the theory agrees with experience. Temperature has spatial and temporal depen- 
dencies. The left-hand side describes changes in temperature over time. Changes in 
heat flux are captured in the second partial of u that appears on the right-hand side. 
Flux, heat energy in motion, acts to equalize temperature. The thermal diffusivity, 
kK, Measures the material’s resistance to heat flux. That is, a temperature difference 
activates the flow of heat but the speed and effectiveness of this flow is moderated by 
material properties. Considering everything, then, the heat equation can be stated 
in one (long) sentence: Changes in temperature over time are caused by (equal to, 
due to, related to) changes in heat flow (moderated or accelerated by properties of 


the material) and thermal source(s). 
2. Notation 


With two or more dimensions, the same equations that looked simple in one 
dimension can begin to look complex. The linear operator, A, is used to simplify 


the notation. For example, Au, substituted into the right-hand side of (D.2), gives 
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the heat equation a new look: 


Ou 
ian KAu + Q(z, it) (1975) 


This is a more general equation since the linear operator Au can be applied in any 


number of dimensions. For instance (in three dimensions), 


Cru Gu tu 


me ae oe ae 


(D.6) 


Sometimes this operator is called the Laplacian of u and some authors use the del 


operator, V, in these equations (Vu = Au). 
3. Diffusion 


The behavior of thermal energy is actually a special instance of diffusion, 
so (D.5) is often referred to as the diffusion equation. With an appropriate substi- 
tution for x, the equation might describe the spreading of dye through ocean water. 
In an agricultural application, it could characterize water or chemical penetration 
in soil. We shall continue to use the term “heat equation”, though. for the sake of 


consistent terminology and notation. 
4. Laplace’s Equation 


Consider the effect of a few restrictions on the heat equation. Suppose that 
there is no source of thermal energy (Q = 0) and the physical properties of the 
material do not vary (« is constant). Finally, what happens if the time-dependency 
is removed? 

The left-hand side of the equation goes away. This is not so unrealistic. 
Systems may reach a steady (equilibrium) state after a time (especially in the absence 
of sources). We can divide through by « (assuming x # 0) and the equation becomes: 


O-u—s- Ou 


A — —__ =— 
: Ae" EAE 


0 (D.7) 
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This is Laplace’s equation in the two dimensions z and y. Sometimes it is called 
the potential equation since it also describes the cases in which wu stands for 
gravity or voltage. It can also describe “steady-state heat flow... hydrodynamics, 
gravitational attraction, elasticity, and certain motions of incompressible fluids”. 


(Ref. 44: pp. 660-661] 


5. Ellipses 


Although Laplace’s equation seems like a steady-state heat equation, it is 
fundamentally different. It falls in the elliptic class of partial differential equations. 
Consider an ellipse centered at the origin with foci (on the r—axis at a distance of c 
from the origin) located at (—c,0) and (c,0). Suppose that the foci are labeled F, 
and Fy. The major axis passes through the center and through the foci, connecting 
two vertices positioned at (—a,0) and (a,0). The minor axis passes through the 
center perpendicular to the major axis and connects the vertices at (0,—6) and 
(0,5). The major axis deserves its name since a > b (in the case of equality the 
ellipse degenerates and we get a special case—the circle). 

For any arbitrary point, p, let the distance d, be the distance from p to Fy 
and let d, be the distance from p to F,. Furthermore, let d = d, + dz. The ellipse 
is described by all points satisfying d = 2a, where a is the constant length of the 
ellipse’s semi~major axis as described above. The standard form for the equation of 
this ellipse is 

eel (D.8) 


Using the distances from this ellipse, a right triangle can be formed with sides of 
length 6 and c and hypotenuse of length a. This means a, 6, and c are related by the 


Pythagorean Theorem. 





Figure D.1: The Region 


6. Poisson’s Equation 


We have discussed several partial differential equations and observed the 
impact of changing a few parameters. Laplace’s equation showed what happens in 
the steady-state case when sources are removed and the thermal diffusivity is non- 
zero. Now we return to the more general problem that can be represented in the 
presence of a source, sometimes called a driving (or forcing) function, say f(z, y). 


The result is Poisson’s equation (shown here in two dimensions): 
Ou OPu 
Au=->—-+-7>— = f(r, D.9 
aa + aye = Se) (D.9) 
Again, u(z,y) typically represents temperature or voltage. Laplace’s equation (D.7) 
is just the special case of Poisson’s equation (D.9) where f(z,y) = 0. The rest of 
the discussion will focus on Poisson’s equation within the rectangular region (shown 


im Figure D.1):0<7< L0<y< H. 
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Figure D.2: Subdividing the Rectangle 


7. Final Assumptions 


We shall assume that the conditions along the boundaries are known and are 
given by u = g(x,y). The problem is solved in the presence of a forcing function f. 


“solve”. To reach 


The goal is to produce something that a computing machine can 
this position, several steps are required. First, the domain is divided into many 
smaller regions. Using this subdivision scheme, a system of equations is developed. 


The information that is known (f and g) can be moved to the right-hand side of the 


system. The system can then be represented in typical Az = 6 fashion. 
C. DISCRETIZATION 


Before attempting a numerical solution, the domain must be subdivided into a 
finite (but probably large) number of elements. Figure D.2 provides an illustration 
of what this mesh looks like. We should not forget that actual applications may 


involve 100 (or more) divisions in each direction. Nevertheless, (artificially) small 
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examples are quite sufficient for conveying notation and measures within the region. 
1. Notation 


A clear understanding of the problem domain, conventions, and notation 
is prerequisite to developing the system of equations. Consider Figure D.2. This 
domain will serve as a reference for the upcoming discussion on conventions and 
notation. 

The rectangular region has length L = 9 and height H = 5. It has been 
subdivided into 45 smaller elements by a mesh made of four horizontal lines and eight 
vertical lines. The integers m and n are used to keep track of how many horizontal 
and vertical dividing lines are used (here m = 4 and n = 8). Each element has length 
h (in the z-direction) and height / (in the y-direction). In this particular example, 
the elements are (conveniently) square with h = k = 1. In general, the individual 
elements within the region are rectangular (it 1s not necessarily true that h = k). 

The elements within the region are uniformly spaced (each has the same 
size). L, H, h, and k do not need to be integers—they can be any convenient units. 
To guarantee uniform spacing, of course, L and H must be integer multiples of h 


and k, respectively. That is: 


L 


II 
—"~ 
= 
+ 
—" 
— 
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i Oe 


Heme tmme 40016253. 05.) 
2. Internal Mesh Points 


Our goal is a system of equations, and ultimately a problem stated in terms 
of a matrix and vectors. We will eventually see that there are mn equations in mn 
unknowns, one for each internal mesh point (where the lines cross). Imagine elements 


of size h x k (as before) that are centered on these points, such as the cross-hatched 
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element at (7,3). Each equation in the system will correspond to one of these line- 
crossings and represent one of these elements. It is useful to label the lines for 
reference purposes. To accomplish this, we use the (integer) counters 7 and 7. 
These counters are used to reference particular vertical and horizontal di- 
viding lines. The 2 counter refers to a vertical line (1 <2 < n) and the horizontal 
lines are indexed by 7 (1 <7 < _m). Figure D.2 may be deceptively simple due to 
the element dimensions h = k= 1. Because of this, 2 = 7 indicates an x—coordinate 
of 7 and 7 = 3 means y = 3. But the counters 72 and 7 are not generally equivalent to 
r— and y-position in the coordinate system. Given h, k, 1, and 7 the corresponding 


coordinates are (r, y) = (2A, 7h). 
D. A SYSTEM OF EQUATIONS 


The next step is to build a system of mn equations that describes the problem. 
First, we need to agree upon a referencing scheme for the internal mesh points. The 
numbering will be based upon 2 and 7 as defined above. This numbering scheme 
begins at the bottom left (i.e., 2 = 7 = 1), proceeds up the first column and then 
moves, column-—by-column, to the right. Specifically, the points will be assigned a 
label 

f=m(t-1)+ 9 (D.10) 


Given the values 7 and 7 for any internal point, now we can assign it a label 
(1 < &£ < mn). Figure D.3 shows values of 7 along the z-axis, values of 7 


along the y-axis, and labeling of internal mesh points according to (D.10). 
1. Finite Differences 


The approach calls for analyzing each internal mesh point. Figure D.4 


shows the point referenced by 7 and j and its neighbors to the North, South, East, 
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Figure D.3: Numbering the Equations 


and West. We use a centered finite difference method to approximate the partial 
derivatives in (D.9) and arrive at the equations for these points. The finite difference 


approximations for the partial derivatives are: 


Pu (FS eee ae 

Pu Winns — Qui gj + Wig D.11 

Ox? (:,5) h? | | 
Ge, Mion Stig 1 Mase (D.12) 


Ov (a5) ke 

The approximation for the partial derivative in the r—direction (D.11) con- 

siders the neighbor to the West, the point itself, and the neighbor to the East. 

Similarly, the approximation in the y-direction (D.12) recognizes neighbors to the 

South and North in addition to the point. Both finite difference approximations 
favor the center point (2,7), giving it twice the weight of its neighbors. 


Substituting these into Poisson’s equation (D.9) yields: 


= en Ze een Uu;j- 1 Bue a) 
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Figure D.4: Neighbors to the North, South, East, and West 


The forcing function, f;;, is known so (D.13) begins to look like one of many equa- 
tions in a linear system. There is such an equation for every internal mesh point. 
To make sure that we consider all of the internal mesh points in an orderly fashion, 


we may number them as in Figure D.3 and consider them one at a time. 


2. More Equations 


At this point, we know the general form (D.13) for each of the equations 
that must be considered. The matrix of coefficients may not be completely clear yet, 
so let us consider each of the equations in the order of their labels. For now, we will 


leave the 2,7 subscripts on everything: 


(Hou — 2u,) + U2) Ui0 — 2U, 3 + U42 


wet Sat) (SO) we Si 


Ug,.2 — 2U,.2 + U22 Uy. — 2U,2 + U3 


“Game 


168 


U0 mea 2) Fr U2 |] tien? i DUA mi a= Uim ne 
) ~~ ( ) ne —fim-1 


a 1 fe 


UO,m oe Deigiees + oui) (Gia _ TAN fe ala?) ~ f 
Sanaa 1 ae 


a ke 


Ue 20 tia U2,0 — 2U21 + U22 
_" ) = (a) fr 


a h2 }2 


U,yo — 2u u Uo, —2Uootu 
1,2 7 ia 32) a 2,1 2,2 23) eee 


—( Re j2 


UO eo ie cas aT ene =a sive (a _ Dp a] te UIm 


ee eg 


) ~ Ses 


Ulm PAV en a eur) me! =, on =e UIm4) = if 
= Se oS 2,m 


a RB 


2 Wet 1 a Un Un-1,0 — Das eae a" Un-1,2 
a 


=A h?2 we 


Un—2,2 — 2Un-1,2 + Un? Ueno Una + ae) ~ 


a = ome ) = —__- ~< —fn-1.2 


ee va, pen, ee es =O peas ee 


—2tncimnt t Manat) _ (Mactmns ) 2 —Sactyner 


h2 k2 


= Un-2,m-1 


169 


Un—~2.m — yA nee eo + Unim U1 1) Zee AF Un-1,m41 


=== ea = oe ~~ — froigm 


Unga} => PATON ate Ue] Uno — Zt fe Un 2 


— (Ses Sint Shy (Sint TE) = fa 


Uy. = PAV: ap ia Un — Zits + Un3 


a a (= ae =~ —$n,2 


(oie al Lila = Suniel (eae = Dn ne =e Unim 


hh? j-2 


) ~ =|, 


Un-1,m <3 Zier “ip Un41,m 


ck ira, = 


(Saal aoe 74) gee ae maushata) ~ — form 


k-2 


3. Modification 


The goal is to determine u, ; for all internal points (7,7). Having completed 
several foundational steps, we can see a developing system of mn equations. Let’s 
clean it up a bit. To do this, we need to make better use of one more piece of the 
given information—the boundary values. For those points just inside the boundaries 
(a horizontal distance of h from the sides and/or a vertical distance of k from the 
top or bottom) we already know part of the left side of (D.13). In particular, any 
subscript 7 = 0,7 = 0,2 = n+1, and/or 7 = m+1 signifies a (known) boundary 
point. 

Multiplying through by (hk)? and moving the known information to the 
right-hand side of the equations, we again start with the left-most column (7 = 1) 


and work in the order of the labels. Now the system of equations looks like this: 
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Now the equations are very close to what we want. There are some unfor- 
tunate side effects to such a deliberate approach. The list of equations is tedious, 
the subscripts are a bit involved, and it takes some concentration to match things 
up. There are some benefits, though, for those who can endure! It will take very 


little effort to see how the coefficients are collected. 
E. MATRIX REPRESENTATION 


It is not hard to translate the preceding equations into the familiar representa- 
tion Ar = 6. Notation is quite important. We will start with the obvious, exchanging 
u for z so that (eventually) the system will look like Au = b. Dimensions are impor- 
tant too. The goal is a large, sparse, symmetrix matrix A € R™" * ™". The vectors 
u and b have the obvious dimensions and are assumed to contain real numbers as 


well. 


1. Unknowns 


Since there is a great deal of structure in this problem, it is useful to 
partition the vector of unknowns, u. Let u;; have the same meaning as it did 
in equation (D.13) and consider the m-vector: 

Ui) 
Ui,2 
Ui m-1 
t Yim J 
This vector captures all of the unknowns for a given column, z, of the original region. 


Now we can stack the columns, n in number, forming the entire vector u of unknowns: 


Un-1 
Lou 
This process has clearly formed u € R™". Now we turn to the matrix of coefficients. 


2. Coefficients 


The matrix A is formed by combining two smaller matrices, 7 and D. First 
we shall consider the tridiagonal matrix T € R™*™. For aesthetic purposes only, let 


the diagonal elements of T be d = 2(h? + k*). 


d  —h? 
=? aie =e 
—~h? qd —h? 
T = 
—A? d —h 
~h? d= —h? 
-h? d 


Next, consider the diagonal matrix D € R™*™: 
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Forming the matrix A requires n identical copies of T and 2(n —1) identical 
copies of D. The matrices in A below are assigned subscripts for counting purposes. 
The matrix subscripts, by the way, denote a value of 7 corresponding to the partition 


u; which the matrix will multiply. A is the block-tridiagonal matrix 


ies 
D, T2 Ds 


Dy. T3 Ds 


3. Knowns 


We could proceed immediately to the solution vector, b € R™", using the 
equations provided in the previous section. Again, though, the result can be cleaned 
up a bit if we form 6 as the sum of three vectors f,v, w. 

The vector f € #”” represents the forcing function. The equations clearly 


indicate where the scalar multiplier comes from. 
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Next, the vector v € #™” is used to represent the information that is known 


due to the boundary values on the East and West sides of the region. 


uo 
Uo,2 


UO.m-1 
Udm 


0 


0 
Un+1,1 
Un+1,2 
Un+i1,m-1 
Un+1,m 
Finally, the vector w € R™”" is used to represent the information that is 


known due to the boundary values on the North and South sides of the region. 


U10 


U2m-41 
U3,0 


Un m+ 


Now 6 is a simple sum of these vectors: b= f+u+uw. 


175 


F. CONCLUSION 


This process has shown a few examples of partial differential equations that 
appear frequently in nature. Poisson’s equation in two dimensions was selected as 
an example. After the finite difference approximation is selected, determining the 
system of equations is a tedious (but not too complicated) process. Once the system 


of equations is written down, the matrix representation is easy to come by. 
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APPENDIX E 
HYPERCUBE COMMUNICATIONS 


This report displays the results of point-to-point communications tests that 
were performed on the Intel 1PSC/2 hypercube. The emphasis of the experiment 
was to evaluate several aspects of communications time. The exercise showed that 
communication on this machine is virtually independent of the Hamming distance 
between communicating nodes. There is clear evidence that transmission rates are 
related to message length (the transmission system favors longer messages) due—at 
least in part—to an overhead charged to begin the communication. Communications 
between the host and a node never achieve the rate that can be realized with node- 
to-node transmissions. 

The communications test code described in this appendix was only executed on 
the iPSC/2. Time did not permit modification of the code and testing on the trans- 
puter networks. A thorough test of communications and computational abilities of 
the T414 and T800 transputers has already been performed by Gregory Bryant. His 
masters thesis [Ref. 26] contains the documentation of this work. A short summary 


of Bryant’s findings is included in the conclusions to this appendix. 
A. SOURCE CODE OVERVIEW 


The host program (commtst.c) and a node program (commtstn.c) contain 
most of the code for this experiment. There is also a header file, commtst.h, shared 
by these codes,. Finally (but perhaps most important for any high-level survey of the 


code), the makefile commtst.mak shows dependencies and compilation procedures. 


In the discussion that follows, bold-faced type is used to indicate function and object 


names that actually appear in the code. 
B. STRATEGY 


The program must define the valid arguments. The function interpret_args() 
takes care of checking for occurrences of these arguments in the command line. 
When the arguments have been interpreted, we know how to set variables like reps 
(repetitions), bytes (length of the message to be passed), and verbose (to control 
how much data is spewed out). Once these values are known, the host instructs each 
node to either RECEIVE or SEND. A special Tasking packet (structure) carries 
instructions to each node independently. Only one node is designated to SEND 
at any one time; the rest RECEIVE. Receivers simply crecv() the given number 
of bytes and return the message to the originator by calling csend(). Since this 
involves a round-trip, the issue of timing requires attention. 

We can divide the time measurement by two (to account for the round-trip), 
provided we aren't deceived by the outcome. That is, passing two b-byte messages is 
not the same as passing a single message of length 2b bytes. To make the timing data 
credible, however, the round-trip method is essential. The precision of the mclock() 
function is an additional issue. At best, mclock() is accurate to the millisecond (and 
ten milliseconds may be a move reasonable expectation). Very short messages can 
produce questionable results in terms of the precision of the timing data. 

For this reason, tests of short messages should be repeated a number of times 
within the block surrounded by time checks. This, of course, revives the same issue 
(multiple repetitions of a message are not equivalent to a single, longer message). 
We may proceed, however, provided we establish a common understanding of the 
problem domain and terminology. I have used the term effective time to capture this 


subtlety. 
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Wherever this term appears, it should be interpreted according to the following 


definition: 
t 


«= Fp 
where t, is the effective time, ¢ is the actual time measurement for the message, and p 
is the number of repetitions. The factor of two is included to account for the round- 
trip. For instance, suppose that the user asks for three repetitions of a message. The 
implementation carries this out in a for loop. Time is sampled before and after the 
loop. The inside of the loop is the simple csend() and crecv() sequence described 
earlier. The effective time in this example would be t, = t/6. 

In summary, there is no convenient (and credible) method for timing one-way 
communications. If we time one-way communications, the results could be mis- 
leading in that we could not be certain that the clock was starting just before the 
beginning of the csend() and stopped immediately after the receiving node accu- 
mulated the final byte of the message. We must also consider the issue of blocking 
communication.’ Thus, the (round-trip) method is not so easily misled by the fact 
that csend() is not actually blocking. The transmission duties are quickly handed 
over to a communication manager and processing continues directly. The crecv() 
enforces blocking communications and execution stops at this function until the last 
byte has been acquired. Thus the round-trip method seems to be quite reliable, 
particularly in the case of node-to—node communications (if the host is involved, the 
results are less consistent). 

Since receiver nodes have nothing else to do but receive and retransmit the 
message, the performance loss due to the round-trip method should be (almost en- 
tirely) accounted for by two factors (loosely) placed into “software” and “hardware” 
 1By definition, blocking means that the invoking process (send or receive) causes execution of 


the program to stop (be blocked from the CPU) until the communications requirement has been 
satisfied. 
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categories: 


e Software overheads like establishing and freeing the activation stack for functions 


(e.g., the csend() and crecv() functions). 


e Hardware overheads associated with establishing the communication path and 


performing switching. The take-down time for this task is probably negligible. 


Hence, if this method of analyzing communications performance errs, it does so on 
the conservative side. That is, the timing used in this method is liberal (if anything), 


so that communication rates will be estimated conservatively. 
C. RESULTS 


Considering the nature of the implementation, communications will be consid- 
ered bidirectional. In particular, the term “host-to-node” communications does not 
imply that the host is the originator of directed communication, but that a bidirec- 
tional exchange takes place between some node and the host. The host does send 
directed, one-way instructions to the nodes, but all timed communication originates 
at a node and returns to that node (even if it goes to the host). There are essentially 
three groups of results; each of which captures data for node-to-node communica- 


tions and host-to-node communications. 
1. Small Messages Repeated Ten Times 


The first test involved messages of length € < 1,024 bytes. Since the 
shortest of these would not generate trustworthy timing data, the repetition count, 


p, was set at ten. This gave ¢, = ¢/20. Table E.1] shows the results. 
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Figure E.1: Speed of Small Host-Node Messages (Ten Repetitions) 
a. Host-to—Node Performance 


The communication rates for small host-node messages with a repeti- 
tion count of ten are illustrated in Figure E.1. Communications involving the host 
produce very irregular results (in the sense that the relationship between length and 
performance is not straightforward). The experiment was executed when only one 
user was logged in at the host and the results followed the same general pattern on 


repeated tests. 
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Figure E.2: Speed of Small Messages Between Nodes (Ten Repetitions) 
b. Node-to—Node Performance 


In the absence of contention for the communication medium, node- 
to-node communications within the cube are quite predictable. Figure E.2 shows 


transmission rates for small messages (up to one kilobyte) repeated ten times. 
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TABLE E.2: SHORT MESSAGES WITH ONE HUNDRED REPETIViOn 


Message Node-to-Node 


Length t Le Rate 
(Bytes) || (msec) | (msec) | (kbytes/sec) 
0.34 


2.80 
5.69 
11.37 
22 ol 
44.45 
87.17 
166.00 
181.69 
263.53 
340.60 
AES 
480.15 
543.48 
604.96 
662.54 
716.33 
766.87 
818.78 
863.44 
oO val 
948.41 
988.14 


818.30 

795.00 

774.50 

758.30 

137.10 

721.30 
1020.10 
1007.10 
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1004.50 
1013.40 
1043.80 
1152.90 
1335.40 
1419.50 
1688.50 
1869.90 
1520.00 
1070.30 
1061.60 
1048.80 


Host-to—-Node 


t i Rate 
(msec) | (msec) | (kbytes/sec) 
837.40 4.19 


0.23 
0.48 
0.98 
2.02 
4.12 
8.48 

] fess 
24.51 
37.24 
49.65 
62:22 
74.01 
83.83 
86.74 
84.24 
88.06 
81.43 
S022 
106.91 
163.51 
] (OrG2 
190.69 





2. Small Messages Repeated One Hundred Times 


For the next experiment data was collected from runs using the same mes- 


sage lengths, but the repetition count, p, was raised to one hundred. This gives 


t. = t/200, as shown in Table E.2. 


a. Host-to—Node Performance 


Figure E.3 gives the transmission rates corresponding to this data. 
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Figure E.3: Speed of Small Host-Node Messages (One Hundred Repetitions) 
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Figure E.4: Speed of Small Messages Between Nodes (One Hundred Repetitions) 
b. Node-to-Node Performance 


Figure E.4 shows the transmission rates for the node-to—node messages. 
This data may have important implications. Consider the transmission of a matrix 
row-by-row within a loop (where one row is transmitted each time through the 
loop). The expected communications performance is related to the number of bytes 


in a single row of the matrix, not the size of the entire matrix. 
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3. Larger Messages 


The final test considered longer messages (1,024 < @ < 262,144) that were 
not repeated. This gives t, = t/2. Since the experiment was performed over a rather 
large set of message lengths, the data is divided at an arbitrary point. Messages 
of 64K bytes and less are designated “medium” length messages and placed into 
Table E.3. Messages of length 128h bytes and greater are designated “long” messages 
and placed into Table E.4. There is no hidden significance to this separation, it just 
made for tables of reasonable length. 

The figures that follow are based upon the combined data of both of these 
Tables. The host terminates execution at the crecv() if we ask for more than 262,144 
bytes in a single message. Chapter 2—iPSC/2 C Library Calls—of (Ref. 45: pp. 2- 
16, 2-19] explain: “messages to or from a host process are limited to a maximum 
of 256K bytes. There is no limit on message length between nodes.” This explains 


why the data stops at that message size. 
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Message 
Length 
(Bytes) 
1024 
2048 
3072 
4096 
5120 
6144 
7168 
8192 
9216 
10240 
11264 
12288 
13312 
14336 
15360 
16384 
17408 
18432 
19456 
20480 
21504 
22528 
23502 
24576 
25600 
26624 
27648 
28672 
29696 
30720 
31744 
32768 
65536 









TABLE E.3: MESSAGES OF MEDIUM LENGTH 
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t ip Rate 
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TABLE E.4: LONG MESSAGES 


Node-to—Node 
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2677.82 
2682.48 
2684.79 
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Figure E.5: Speed of Large Host-Node Messages 
a. Host-to—Node Performance 


The host-to-node communication rates (for large messages) are illus- 


trated in Figure E.5. 
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Figure E.6: Speed of Large Messages Between Nodes 
b. Node-to-Node Performance 


Figure E.6 shows the transmission rates for the same long messages 
when passed among nodes of the hypercube. To move the plot of Figure E.6 out 


into the open, a plot of transmission rate versus log jg £ is shown in Figure E.7. 
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Figure E.7: Node-to-Node Transmission Rates for Large Messages 


D. CONCLUSIONS 


One of the obstacles that this experiment carefully avoided was competition 
for the links. Contention for communications resources may be inherent in certain 
parallel programs. Potential causes and effects of contention should always be given 
due consideration in the crafting of a parallel application. All of the algorithms that 
were tested in this research work involved very structured, regular communications 
schemes. An application with very random communication patterns should be ex- 
pected to behave very differently. Additionally, the communication scheme for every 
program in this work was designed to use the shortest possible path. 

The circuit switching approach has the disadvantage that a single message must 
control the entire path from origin to destination. Under a less controlled, random 
pattern of communications the performance of the communications subsystem might 
reasonably be expected to exhibit degraded performance. Other portions of this the- 
sis show that a communication—bound algorithm can experience severe performance 
degradation as well. There is no specific claim that the results obtained in this 
experiment represent an upper bound for node-to-node communications within the 
hypercube, but they are probably good estimates for an upper bound. 

Host-node communication is slower than node-to-node communication. This 
is not surprising (consider the physical distances and materials). In the absence of 
competition for the links, node-to-node transmission rates are essentially predictable 
for a given message length. There is a tremendous rise in transmission rate as message 
length goes from one byte to the vicinity of twenty kilobytes. Thereafter, smaller 
(apparently asymptotic) performance gains are achieved by increasing the message 
size. A similar phenomenon occurs with host-node communications but it takes 
much longer messages to break, say the two megabytes-per-second transmission 


rate. 


USES) 


These performance measures are quite appealing for long messages, but con- 
sider transmissions of shorter (and possibly repetitious) messages. The data shows 
that short messages are penalized, even if they are part of a loop that involves a 
good deal of communication. Each instance of csend() or crecv() is distinct and 
incurs its own start-up cost. This is an important note for anyone considering 
transmission of the rows (or columns) of a matrix within a loop structure. The 
potential of (pre-transmission) storage of matrices (two~—dimensional arrays) into 
one-dimensional arrays might be investigated as a means of increasing the commu- 
nications rate (provided the cost of copying the array is not prohibitive). 

Communications in a transputer network was not developed in this work, but 
Bryant (Ref. 26] gives a very thorough analysis of communications and calculations 
in a network of transputers. On pages 31-34, Bryant gives a good summary of 
unidirectional and bidirectional data transfer rates. He discusses link interaction (1.e., 
how communications performance varies as one, two, or all four of the transputer’s 
links are engaged in communication) on pages 34-38 and concludes that the effects 
of link interaction are minimal. 

Bryant also discusses the effects of varied communication loads on processor 
performance. On pages 38-44, he finds that bombarding a transputer with many 
small messages while it is trying to perform calculations can severely degrade the 
processor's performance. His Figures 3.8 and 3.9 show that—with only one lnk 
active—messages of size 100 bytes and larger cause negligible performance degrada- 
tion. With all four links active, messages of size greater than one kilobyte should be 
used to free the processor from most of the communications overhead. 

Pages 36 and 37 of Bryant’s thesis show the effects of message length on the 
communication rate. Bryant’s Figures 3.4 and 3.5 are quite similar to Figure E.6 
above, but the transputers are much more responsive (i.e., there seems to be less 


overhead involved, so the peak communications rate is achieved much earlier). In 
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fact, the transputers are near their peak transmission rate with messages of 100 bytes 
and messages of one kilobyte and greater always travel at peak rates. 

Comparing a transputer system to an iPSC/2 system—in terms of communi- 
cations performance—is essentially a lesson in the differences between store-and- 
forward switching versus circuit switching for multi-hop communications. Bryant 
shows [Ref. 26: pp. 83-85] that the store-and-forward transmission rates suffer as 
the number of hops grows. The direct-connect (circuit switching) approach recovers 
its overhead on multi-hop communications, but it ties up the entire path to do so 
(making it unavailable to other potential users). The key difference is that commu- 
nications performance with the direct-connect method is very nearly independent of 
the number of hops. 

The transputer system seems to enforce true blocking communications on both 
the sending and receiving ends (byte-by-byte acknowledgment is part of the pro- 
tocol). The iPSC/2 csend() is not blocking, but the creev() function is blocking. 
Proper handling of these issues can become important when implementing an algo- 
rithm. Each method has advantages and disadvantages, but—at least for the current 
systems—transputers seem better suited for applications involving short messages 
over short distance and the iPSC/2 seems to handle long messages over long distances 


better. 
E. SOURCE CODE LISTINGS 


The source code listings for the programs used for these tests are supplied on 
the pages that follow. The makefile commtst.mak appears first and describes the 
dependencies among the files and compilation procedures. Next, commtst.h is the 
header file associated with these programs. Finally, the actual code is given in a host 


program called commtst.c and the node program commtstn.c. 
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commtst.mak 


# Author: Jonathan E. Hartman, U. S. Waval Postgraduate School 
# Purpose: Makefile for Hypercube Communications Test Programs 
# Date: O7 August 1991 


help: 


1 

2 

3 

4 

5 all: hostcode nodecode 
6 

vi 

8&8 chelp 

9 


12 hostcode: commtst.o clargs.o 
13 cc Clargs.o commtst.o -host -o commtst 


is Clargs.o.  cClargs ho vclargs.c 
16 commtst.o: commtst.h commtst.c 


20 nodecode: commtstn.o 
21 cc commtstn.o -node ~o commtstn 


23 commtstn.o: commtstn.c commtst.h 


AY 5 JE en I IS SSS SS SS SSS SS SS SSS 
57 - Tuna) 
28 commtst -d 3 -b 1024 -r 2 


31 # Delete object files, executables --<<=-------<-<<<---<<<<<<<<<-==-=— 
32 clean: 

33 rm *.0 

34 rm commtst 

35 rm commtstn 

36 

37 # EOF commtst .mak ------- oon nr rr rrr rrr is 
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commtst.h 


18 


Ne PROGRAM INFORMATION ee ee 


SOURCE : commtst.h 

VERSION : » 

DATE : O7 August 1991 

AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 


eee w nnn ewww HS EEE SSS ss=s DESCRIPTION a En 


This header file gives common information for use across the host program 
commtst.c and the node program commtstn.c. A more complete description 
can be found in commtst.c. 


ae — _Sew ww mM Hw SM Se HK we ew ewe ewe Ewe ee ew ew ew ew ew eee ee ew ew ee eee ee es ee es eS — 
— .—} —— — a em oe ee a om om om ae = oe ee ee we ge ee we ws eee Bw ew Sw SE ee Mee ee Ew Mw ee ew eee ee eM ee ee Oe SE se ee oe eer Oe ene 


#ifndef EXIT_FAILURE 


#Hdefine EXIT_FAILURE -1 
#endif 
#Hdefine MAX_CUBESIZE 16 
#define ROOT ca 
#Hdefine RECEIVE 
#define SEND ih 
#define FALSE 0 
#define TRUE 1 
/* wee ee He ew eH Bees TYPE DEFINITION Serr stsrscorsrr—- -- - -- ---- - 
i 
* The following structure is the framework that the root processor (host) 
* uses to pass instructions to the worker nodes in the cube. 
« / 
typedef struct { 
int task; /* choose RECEIVE or SEND as above * / 
long bytes; /* length of message «/ 
long reps; /* number of repetitions */ 
int destination[MAX_CUBESIZE]; /* for senders: identifies addressees */ 
} Tasking; 
/* wee enn — — — — | eS eee eset EOF commtst.h Sees essseste-—-- -- - - ee - * / 


sh 


conimtst.c 


oan mM Ot Bb WD WY = 


10 


~~ 
* &© &+ &© &© & & &# & &# H& H& He H& HH OUHCUF 


# 


*# &* # + + & & + 


— eee ew we ee ee ee Ss PROGRAM INFORMATION sossscsscsss--~--—-—- - 


SOURCE : commtst.c 

VERSION : 122 

DATE > O7 August 1991 

AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 


USAGE : commtst [-d dimension] [-b bytes) [-r repetitions] [-v] 


EXAMPLE ; If you type ‘commtst -d 3 -v -b 1024 -r 10’, it means to 
run the program on a dimension 3 hypercube in the verbose 
mode, with messages of length 1024 bytes, and 10 repeti- 
tions for each message. 


REFERENCES : [1] iPSC/2 Programmer’s Reference Manual 


we we we we we oe ee ee eee ee etree DESCRIPTION Sess sss sts cscs tH www ew 


This program runs on the host. It orchestrates various point-to-point 
communication tasks between nodes of a hypercube. The time of round-trip 
communications is gathered and printed out. The output includes the time 
required and rate of communication (taking into account repetitions and 
round-trips). The ‘verbose’ mode gives a more detailed node-by-node 
accounting of the run. 


=e eee ee ee ee ee em me em ee em wm em om om oe om ee em om em om om om om om om om OF ee we ww we Oe ee ee eee ee 


char *version = "Hypercube Communications Test, Version 1.2"; 


~ 
+ + + & &© & & & & & & & & F 


# 


we ee wn wn wr wr en ne BS eee ALGORITHM tS ee 


The root (host) processor determines who will communicate with whom, and 
when. No node operates independently. The host identifies a sender and 
receiver(s). The host also gives the length of the message that should 
be passed and the number of times that the message is to be repeated 
(multiple repetitions may be required when the message is short since 
mclock() returns milliseconds). The ‘Tasking’ structure holds instruc- 
from the manager (i.e., SEND or RECEIVE, the length of the message, num- 
ber of repetitions, and addressees). When this structure is received at 
a node, it performs the task and awaits further instructions from the 
manager processor. If the processor is a sender, it returns timing data 
to the host upon completion. 


oe ee oe oe oe oe oe om om om om om oe oe ae ae Se om ee ee ee ee ee es ee ee ee i ee oe om om om om om ow «= == 
—_ = —_—_ 
-—— a oe em em em em oe OO oe ae Oe OO ee me me em ee ee om ee ee ee ee ss a a a ae ee ew eee ee Oe ee eee = 
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51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
oe 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
oT 
32 
93 
94 
95 
96 
97 
98 
99 
100 


#include <stdio.h> 
#include “commtst.h" 
#include "“ipsc.h" 
#include "“macros.h” 
#include "“clargs.h” 
#define ASCII_CONVERSION 48 /* for char -> int conversion of 0...3 ¥*/ 
#define CT_SIZE 4 /* for cubetype[] size * / 
#define NUM_ARGS 4 /* -d-b-r -v x / 
#define DIM O /* index values into optv([] * / 
#define BYTES 1 
#define REPS 2 
#define VERBOSE 3 
[* ------- - SSS SssSsssss== FUNCTION DEFIWITION Sees ss se sees ea = ea aoa * / 
#ifdef PROTOTYPE 
void init(int argc, char **argv, char cubetype[CT_SIZE], 
int *dim, long *bytes, long *reps, int *verbose) 
#else 
void init(argc, argv, cubetype, dim, bytes, reps, verbose) 
inte “argc; 
char **argv, 
cubetype[CT_SIZE] ; 
int *dim; 
long *bytes, 
*reps; 
int *verbose; 
#endif 
{ 
mt count = 1, 
valid = FALSE; 
Opt_Struct *optv([NUM_ARGS] ; 
/* The first step is to make a table of all of the valid arguments. The 
* structure is defined more carefully in clargs.h, but the basic idea is 
* that we have an array of pointers to type Opt_Struct (option structure) 
* ...in this case, there are WUM_ARGS valid arguments and the next few 
* steps take care of allocation and definition of them. When this is 
* done, it is time to call interpret_args() to see what the user entered. 
*/ 
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101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
IZ 
113 
114 
115 
116 
Lig 
118 
119 
120 
12) 
122 
123 
124 
125 
126 
par 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 


147 


148 
149 
150 


optv [DIM] ="(Opt_Struce 
optv [BYTES] = (Opt_Struct 
optv [REPS] = (Opt_Struct 
optv[VERBOSE] = (Opt_Struct 


optv[DIM]->lanswer = (long 
optv[BYTES]~>lanswer = (long 
optv [REPS] ->lanswer (long 


/* The intel compiler didn’t like .. 


optv [DIM]->argname [0] 5 
optv [DIM] ->argname [1] = 
optv [DIM]->subarge = 
optv (DIM]->subargi = 


optv [BYTES] ->argname [0] = 
optv [BYTES] ->argname [1] = 
eptv (BY ES) —>subarec = 
optv[BYTES]->subargi = 


optv [REPS]->argname [0] = 
optv [REPS] ->argname[1] = 
optv [REPS]->subarge = 
optv [REPS]->subargi = 


optv(VERBOSE]->argname{0] = 
optv(VERBOSE]->argname[1] = 
optv [VERBOSE] ->subarge = 


*dim = -1; 


interpret_args(argc, argv, 


*) calloc( 1, sizeof(Opt_Struct) ); 

*) calloc( 1, sizeof(Opt_Struct) ); 

*) calloc( 1, sizeof(Opt_Struct) ); 

*) calloc( 1, sizeof(Opt_Struct) ); 

*) calloc( 1, sizeof(long) ); 

*) calloc( 1, sizeof(long) ); 

*) calloc( 1, sizeof(long) ); 
.~>argname = "~d"; etc. ¥*/ 

dee 

ist © dees 

1; 


WEXT_LOWNG; 


es ilo 
» 


yo) = 
1; 
WEXT_LONG; 


est) 0 
» 


i ge 
1; 
WEXT_LONG; 


WUM_ARGS, optv); 


if (optv([DIM]->found) *dim = (int) optv(DIM]->lanswer [0]; 
switch (*dim) { 
case 0 : case 1 case 2 : case 3 : break; 
default: 


while ('!valid) { 


printf("Enter desired cube dimension (in {0, 


scanf("%d", dim); 
fflush(stdin); 
switch(*dim) { 


case 0 : case 1 


} 
yy 
>} /* end switch() */ 


>: case 2 : 


15.52, 32) 


case 3 : valid = TRUE; break; 
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151 


if (optv(BYTES]->found) *bytes = optv([BYTES]->lansvwer(0] ; 


152 

153 valid = FALSE; 

154 

155 if (*bytes < 1) { 

156 while (!valid) { 

157 printf("Enter message length (bytes): "); 

158 scanf("%ld", bytes); 

159 fflush(etdin); 

160 if (*bytes > 0){ valid = TRUE; } 

161 else { printf("Message length must be positive.\n"); } 

162 t, 

163 } 

164 

165 if (optv([REPS]->found) { *reps = optv[REPS]->lanswer[0]; } 

166 else { 

167 

168 printf("Non-existing (or invalid) repetition count, "); 

169 printf("using one repetition. \n\n"); 

170 *reps = 1; 

Nop } 

172 

173 Coptv[VERBOSE]->found) ? *verbose = TRUE : *verbose = FALSE; 

174 

175 cubetype[0] = ’d’; /* for dimension (to follow) * / 
176 cubetype(1] = (char)(*dim + ASCII_CONVERSION); 

a7 cubetype[2] = ’f’; /* means nodes are 386/387 combo */ 
178 cubetype([3] = 0; 

179 

180 printf("Initialization complete...Cube Dimension: ‘%d\n", *dim); 

181 prantt(" Message Length: 4%ld\n", *bytes); 
182 printi(" Repetitions: Zld\n\n", *reps); 
183 if (*verbose) printf(" Verbose Mode: ON"); 
184 } 

OE NS) ee eee + / 
186 

187 

188 

189 #ifdef PROTOTYPE 

190 

191 main(int argc, char *argv[]) 

192 

193 #else 

194 

195 main(argc, argv) 

196 

197 int Tare: 

198 char *argv(]; 

199 

200 #endif 


201 


201 { /* begin main() */ 


202 
203 char *cubename = “Hypercube”, 

204 cubetype(CT_SIZE], 

205 *msg, 

206 *nodecode = "commtstn”; 

207 

208 float avg, 

209 avg_hostrate, 

210 avg_hosttime, 

211 avg_rate, 

212 avg_time, 

213 bytes, 

214 reps; 

215 

216 int cubesize, 

217 dim, 

218 Ace 

219 abe 

220 verbose; 

221 

222 unsigned long **timing_data; 

223 

224 Tasking task_packet; 

225 

226 

227 prints @ \n/s naa version): 

228 

229 init(argce, argv, cubetype, &dim, &(task_packet.bytes), 
230 &(task_packet.reps), &verbose) ; 

231 

232 bytes = (float) task_packet.bytes; 

233 reps = (float) task_packet.reps; 

234 bytes *= (2.0 * reps); /* account for two-way communications, reps */ 
235 

236 cubesize = POW2(dim); 

237 

238 timing_data = (unsigned long **) calloc(cubesize, sizeof(unsigned long*)); 
239 

240 for (i = 0; i < cubesize; itt) { 

241 

242 timing_data[i]=(unsigned long*)calloc(cubesize,sizeof(unsigned long) ) ; 
243 } 

244 

245 if (!'(msg = (char *) calloc(task_packet.bytes, sizeof(char)))) { 
246 

247 printf("main(): Allocation failure for msg.\n"); 
248 exit (EXIT_FAILURE) ; 

249 } 

250 
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251 
252 
253 


276 


/* 


Get the cube and load the node code */ 


getcube(cubename, cubetype, WULL, 0); 
attachcube(cubename) ; 

setpid(0); 

load(nodecode, ALL_NODES, WODE_PID); 


/* 


* 
» / 


Lox 


Perform the tasking, receive the message, return it, receive and print 
timing data...repeat for all players. The outer loop index, i, will 
represent the sender node. The j index runs the other (RECEIVE) 
players. 


(i = 0; i < cubesize; it+) { 


/* Get the receivers ready first */ 

task_packet.task RECEIVE; 

task_packet.destination[0] = i; 

task_packet.destination[1] = cubesize; /* impossible flags end */ 


fori =. 0s) Se tt) f 


csend(0, &task_packet, sizeof(Tasking), j, WODE_PID) ; 
a: 


for (j = (i+1); j < cubesize; j++) { 


csend(0, &task_packet, sizeof(Tasking), j, NODE_PID); 


1; 

/* Then prepare the sender ==> he can start */ 

task_packet.task = SEND; 

Hom (a= One iio S25 ++) task_packet.destination[j] = j; 
task_packet.destination[i] = ROOT; 


for (j = (i+1); j < cubesize; j++) task_packet.destination[j] = j; 
csend(0, &task_packet, sizeof(Tasking), i, WODE_PID); 


/* Receive from the sender and return his message */ 
for (j = 0; j < task_packet.reps; j++) { 


crecv(ANY_TYPE, msg, task_packet.bytes) ; 
csend(0, msg, task_packet.bytes, i, MWODE_PID); 


/* Receive the timing data from this run and print it */ 
crecv(ANY_TYPE, timing_data[i], (cubesize * sizeof(unsigned long)) ); 


} /* end for (i) */ 
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301 
302 
303 


317 


for (i = 0; i < cubesize; i++) { 


if (verbose) { 


printf("Source Dest. 
printf ("====== === 
printf ("%4d HOST 7% 


printf(" %10.2f\n", (bytes / ((float) timing _data(liJ([i])) ); 


0.0: 


is) 
< 
0a 
iT 


for (j = 0; }, < cubesize-  j+ 


if (1 4= 9)°-¢ 


Time (msec) Rate (kilobytes/second)\n"); 


10lu ", i, timing _datali] (il); 


an: 


avg += (float) timing_data[i][j]; 


if (verbose) { 


printt(* 
printf("  4%10lu 


printf("%10.2f\n", (bytes / ((float) timing_data[iJ[jJ)) 


i; 
h; 
if (j == (cubesize - 1)) f{ 
avg /= (float) cubesize - 1; 
if (verbose) { 
printf (VSssssssssssssssessssssssssssssssssssssssssse 
printf ("==========\n"); 
printf ("Averages 
Printt( aes. 22 
printf(" kbytes/ 
} 
} 


} /* end for(j) */ 
} /* end for(i) */ 


44a", j); 
e timing datalid [3] }e 


OPE 49.1f msec ", avg); 
, bytes/avg ); 
sec\n\n\n") ; 


for (i = 0; i < cubesize; i++) { 


for (j = 03 j < cubesizey i+ 


(1 == j) ? avg_hosttime 
avg_time 


+) { 


+= timing _data[i] [j] 
+= timing datali]([j) ; 
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351 avg_hosttime /= cubesize; 

352 avg_hostrate = bytes/avg_hosttime; 

353 

354 avg_time /= ((cubesize - 1) * cubesize); 

355 avg_rate = bytes/avg_time; 

356 

357 printf("If we average all of the times and rates....\n\n"); 

358 princt(: Average Time: %9.1f milliseconds\n", avg_time); 

359 printt(” Average Rate: %10.2f kilobytes/second\n\n\n", avg_rate); 

360 

361 printf("NOTE: Average and Rate values are for the nodes ONLY.\n"); 

362 print? (”’ They do not include the host timing data.\n\n\n"); 

363 

364 printf("The averages for the node <--> host communications were: \n\n"); 
365 printf (" Average Time: 7%9.1f milliseconds\n", avg_hosttime) ; 

366 print? (" Average Rate: %10.2f kilobytes/second\n\n\n", avg_hostrate) ; 
367 

365 

369 /# -------------SSSSSsssss== EOF commtst.c sessssssssss------------- * / 


commitstn.c 


] /#, seer e nnn mare SSSSSsssse PROGRAM INFORMATION SSsssssss=------------- 
> 

3. * SOURCE commtstn.c 

4 ™* VERSION : 12 

5 ™* DATE : O7 August 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 

7 

8 * 

"90 0 eee ee ee ssSsssss== DESCRIPTIOW Sssssssssssasa=------------- 
10 


* 
* This program is loaded by commtst.c (which runs on the host). This code 
12 * (commtstn.c) runs on the nodes of a hypercube created by the host program. 
* For more information, see commtst.c. 

* 


19 #include <stdio.h> 
20 #include “commtst.h"' 
21 #include "ipsc.h'" 


23 #define SUCCESS 0 


27 #ifdef PROTOTYPE 


29 main(int argc, char *argv[]) 
30 

31 #else 

32 

33 main(argce, argv) 

34 

35 int argc; 

36 char *argv(]; 

37 #endif 

38 { 

39 char *msg; 

40 

41 int cubesize = numnodes(), 
42 a 

43 ae 

44 return_addr; 

45 

46 long rep; 

47 

48 unsigned long start, *timing_data; 
49 

50 Tasking task_packet; 
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51 timing data = (unsigned long*) calloc(cubesize, sizeof(unsigned long)); 
53 for (i = 0: i < cubesize: i++) { 

55 crecv(ANY_TYPE, &task_packet, sizeof(Tasking)); 

57 meg = (char *) calloc(task_packet.bytes, sizeof(char)); 

59 suitch (task_packet.task) { 

61 case RECEIVE 

63 return_addr = task_packet.destination(0]; 

65 for (rep = 0; rep < task_packet.reps; rep++) { 

67 crecv(ANY_TYPE, msg, task_packet.bytes) ; 

68 csend(0O, msg, task_packet.bytes, return_addr, NODE_PID); 


71 break; 


74 case SEND 

76 j = 0; 

7 while ((j<cubesize)&&(task_packet .destination[j]<cubesize)) { 
60 Start = mehock(): 

&2 for (rep = 0; rep < task_packet.reps; rep++) { 

&4 (j == mynode()) ? 

&5 csend(0O,meg,task_packet. bytes ,myhost() ,WODE_PID): 


&6 ceend(0O, mag, task_packet.bytes, j, WODE_PID); 


as crecv(ANY_TYPE, meg, task_packet. bytes) ; 


9} timing_data[j] = mclock() - start; 


93 9 ae 


96 /» Return the timing data */ 
97 csend(0, timing_data, (cubesize * sizeof(unsigned long)), 
96 myhost(), NODE_PID); 


100 break; 
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101 défaul tas: 
103 printf ("Unrecognized task at node %ld.\n", mynode() ); 
104 exit (EXIT_FAILURE) ; 


105 
106 } /* end switch() */ 


109 free(msg) ; 


112 } /* end for() */ 


114 return(SUCCESS); 
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APPENDIX F 
MATRIX LIBRARY 


This appendix contains part of the matrix library, matlib that is often used 
and referenced in other sections and code. It could be argued that “matrix library” 
is a Misnomer since much of the code has little to do with matrices. This criticism 1s 
true, but I will defend the name since the entire reason for the creating such a hbrary 
was to handle matrices in a more reasonable way. The last section of this appendix 
contains all of the source code for Gauss factorization with partial pivoting, and a 
short excerpt from the complete pivoting code. 

The specifications and a portion of the source code for the library are given on 
the pages to follow. The original intent was to include the source code in its entirety, 
but this would require more than double the current number of pages so the source 


has been omitted. The files are divided into three logical groups: 


1. Makefiles that simplify maintenance of the hbrary, show dependencies among 
the files, and describe the compilation procedures that are used to generate the 


loadable (executable) code. 


2. Standard files (mostly C header files) that make definitions available (for con- 
sistency) across a wide range of files. The range is implied by the content of 
the file. These files include manifest constants that are installed using the C 
Preprocessor #define directive, type definitions that are intended for use across 


several files, and macro definitions that are expanded by the C Preprocessor. 


3. Source code files that appear in pairs, like filename.h and filename.c or (mostly) 


as a header file alone. The header file gives remarks, definitions of manifest con- 


209 


stants, type definitions, and function declarations (specifications) that pertain to 
the associated source code (.¢., the code within filename.c). Again, the latter 


has been onntted mn niost cases. 


4. The Gauss factorization code. All of the source code for the partial pivoting 
version is given, and an excerpt of the pivot election function from the complete 


pivoling code ts also provided, 
A. MAKEFILES 


logesmak ‘This makefile ts a standard teniplate for progranis compiled with the 


Logical Systems © (version 89.1) product. 


mathbomak Phis aakefile is nsed to translate mathb iito a useable form. With 
Logical Systems Cy it creates a brary siuntable for installation and use as any 
other normal © hbrary, The portion of the makefile used on the Intel iPSC/2 
sunply works mm Che enrrent directory to translate tle source into object code so 


that other programs can reference it, 
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loge.mak 


# #2 RNR RN FR 


al 


AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 
PURPOSE : Makefile for Hypercube Communications Test Programs (LogC) 
DATE : 10 August 1991 


ROOTCODE=filename 
RODECODE=-filename 
NIF_FILE=filename 


call 


we 2 RRR RRNRRRRNRRNRRRHRNRRRNRRNR iA iN 


~-------------------- OPTIONS AND DEFINITIONS ---------------------- 


The following section establishes various options and definitions. We 
start with PP, the Logical Systems C Preprocessor. The ‘-dX’ option 
(with no macro_expression) is like ‘#define X 1’. Next we set up the 
compilation options for Logical Systems’ TCX Transputer C Compiler. The 
‘-c’ means compress the output file. The options beginning with ‘-p’ 
tell TCX to generate code for the appropriate processor: 


=—pZ Ti2i2 or 1222 
p20 m225 
-p4 T414 
-p45 T400 or T425 
-ps T800 
-p85 T801 or T805S 


Logical Systems’ TASM Transputer Assembler is next. The ‘-c’ means 
compress the output file (it can cut it in half)! The ‘-t’ is used 
because the input to TASM will be from a language translator (TCX’s 
output) and not from assembly source code. 


The final list tells TLNK which libraries to look at during linking. 
It also establishes an entry point. You should always use _main for 
the root node; otherwise use _ns_main (for other nodes). 


PPOPT2=-dPROTOTYPE -dTRANSPUTER -dT21i2 
PPOPT4=-dPROTOTYPE -dTRANSPUTER -dT414 
PPOPT8=-dPROTOTYPE -dTRANSPUTER -dT800 
TCXOPT2=-cp2 

TCXOPT4=-cp4 

TCXOPT8=-cp8 

TASMOPT=-ct 

T2LIB=t21ib.tll 

T4LIB=matlib4.tll t4cube.tll t41lib.tll 
T8LIB=matlib&.tll t8cube.tll t8lib.tll 
REXTRY=_main 

NENTRY=_ns_main 


7a 


53 # ----------------------- DEFAULT ===> MAKE ALL 9 ----------------------- 


56 all: $(ROOTCODE).tld $(WODECODE) .tld 


65 $C(ROOTCODE): $CROOTCODE) .tid 


67 $(ROOTCODE).tld: $(ROOTCODE).trl 


68 echo FLAG c >  $(CROOTCODE) .1nk 
69 echo LIST $(ROOTCODE).map >> $(ROOTCODE).1nk 
70 echo INPUT $(ROOTCODE).trl >> $(ROOTCODE) .1nk 
71 echo ENTRY $(RENTRY) >> $(ROOTCODE) .1nk 
72 echo LIBRARY $(T4LIB) >> $(ROOTCODE) .1nk 
73 tink $(ROOTCODE) .1nk 


$(ROOTCODE) .tr1: $(ROOTCODE).c 
pp $(ROOTCODE).c $(PPOPT4) 
tcx $(ROOTCODE).pp $(TCXOPT4) 
tasm $(ROOTCODE).tal $(TASMOPT) 


$(NODECODE): $(NODECODE) .tld 


89 $(NODECODE) .tld: $(NODECODE).trl 


90 echo FLAG Cc > $ (NODECODE) .1nk 
91 echo LIST $(NODECODE).map >> $(NODECODE).1nk 
92 echo INPUT $(NODECODE).trl >> $(NODECODE) .1nk 
93 echo ENTRY $(WENTRY) >> $ (NODECODE) .1nk 
4 echo LIBRARY $(T8LIB) >> $(NODECODE) .1nk 
95 tlnk $(NODECODE) .1nk 


96 
97 $(BODECODE) .tr1: $(NODECODE).c 


98 pp $(NODECODE).c $(PPOPTS8) 
99 tex $(NODECODE).pp $(TCXOPTS8) 
100 tasm $(NODECODE).tal $(TASMOPT) 


Ja) a 


= === =e = == PaeCUl Lo. ———————— — ————_——_———___ = 
103 # 

104 

105 run: $(ROOTCODE).tld $(WODECODE).tld $(NIF_FILE) .nif 
106 ld-net $(NIF_FILE) 

107 

108 

CC ———————————————— ————--—------- CLEAN UP oo srr rrr rrr rr rr rrr nr nnn nnn 
110 # 

111 

112 Clean: 

113 del $(ROOTCODE) .1nk 

114 del $(BNODECODE) .1nk 

115 del $(ROOTCODE) .map 

116 del $(NODECODE) .map 

117 del $(ROOTCODE).tal 

118 del $(NODECODE) .tal 

119 del $(ROOTCODE).pp 

120 del $(NODECODE).pp 

121 del $(ROOTCODE).trl 

122 del $(KODECODE).trl 

123 

124 


Peer 1 Og C Mah nn nnn nnn nn nn = 
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1 # ----------=essss==== MAKEFILE FOR MATRIX LIBRARY ==========---------- 
2 # 

3 # SOURCE =: matlib.mak 

4# DATE : 17 August 1991 

5 # AUTHOR : Jonathan E. Hartman, U. S. MWaval Postgraduate School 

6 # 

7 # PURPOSE : Make the matrix library ‘matlib’. 

8 F 

9 # REMARKS : This makefile works with Logical Systems C, version 89.1, 
10 # and the Intel iPSC/2 compiler. The LogC portions of this 
11 # makefile actually construct libraries of the functions available in the 
12 # source files indicated. There are two libraries generated--matlib4.tll 
13 # & matlib8.tll---since the code 1s compiled for T414 or T800 processors. 
14 # For the Intel compiler, I have not created a library; but have used the 
15 # object code as needed. There are a few sections that pertain to both 
16 # compilers. The sections that only pertain to a particular compiler are 
17 # clearly marked ‘Intel iPSC/2’ or ‘Logical Systems C’. 

18 # 

19 # wren BR KKK SS SSS SS SS SS SS SS SS SS SS SS SS SSeS Se Se ee ee eee ee ee ee Se eee ee ee ee eee Se wees 
20 

21 

22 

23 

24 

25 # ----------===s======= 1.) DEFINITIONS AND OPTIONS $==========---------- 
26 # 

27 # The following options and definitions are required. A more thorough 

28 # explanation can be found in ‘logc.mak’ or in the Logical Systems C 

29 # Transputer Toolset manual. 

30 # 

31 ¥ elie — Se ee ee 
32 

33 THISMAKEFILE=matlib.mak 

34 

35 

36 Bo emer mmm mmm mS SSS SSsssesss== 1.1) Intel iPSC/2 Ssssssssessss22-------- == 
37 # 

38 


39 # MATLIBDIR is the directory that contains the matlib files 
40 MATLIBDIR = /usr/hartman/matlib 


41 OBJECTS = clargs.o comm.o hcube.o generate.o mat_ops.o matrixio.o memory.o math.o 


sep.o timing.o vec_ops.o 

42 

43 

44 

45 

46 # ----------=ssessss===== 1.2) Logical Systems C ============---------- 
47 # 

48 

49 T414LIBNAME=matlib4 
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50 T8OOLIBNAME=mat1ib8 

51 

52 TRL4FILES=clargs.tr14 comm.trl4 complex.trl4 generate.trl4 machine.trl4 mat_ops.trl4 
math.trl4 matrixio.trl4 memory.trl4 num_sys.trl4 sep.trl4 timing.trl4 vec_ops.trl4 

53 TRL8FILES=clargs.trl18 comm.tr18 complex.trl8 generate.tr18 machine.trl8 mat_ops.tr18 
math.tr18 matrixio.tr18 memory.tr18 num_sys.trl18 sep.tr18 timing.tr18 vec_ops.trl18 

54 

55 TLIB4FILES=clargs comm complex generate machine mat_ops math matrixio memory num_sys 
sep timing vec_ops 

56 TLIB8FILES=clargs comm complex generate machine mat_ops math matrixio memory num_sys 
sep timing vec_ops 

57 

58 PPOPT2=-dPROTOTYPE -dTRAWNSPUTER -dT212 

59 PPOPT4=-dPROTOTYPE -dTRANSPUTER -dT414 

60 PPOPT8=-dPROTOTYPE -dTRANSPUTER -dT800 

61 

62 TCXOPT2=-cp2 

63 TCXOPT4=-cp4 

64 TCXOPT8=-cp8 

65 

66 TASMOPT=-ct 

67 

68 T2LIB=t21lib.tll 

69 T4LIB=matlib4.tll t4cube.tll t411ib.tll 

70 T8LIB=matlib8.tll t8cube.tll t8lib.tll 

71 

72 RENTRY=_main 

73 WEXTRY=_ns_main 

74 

75 

76 

77 

78 

79 # ----------======= 2.) INSTRUCTIONS FOR DEFAULT MAKE =======---------- 


The following sections give the default (since they appear first in the 
makefile) options for this makefile. By commenting one or the other 
out, one can get to the defaults easily. 


oD 
i) 
“ae 2 2H BH 


87 ipsc: imatlib 
88 clean: iclean 
89 # tptr: tmatlib 
90 # clean: tclean 
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103 


105 
106 
107 
108 
109 
110 
Lr 
112 
113 
114 
115 
116 
117 
118 
119 


imatlib: $(OBJECTS) 


# aie Ie Ree ee — eee 


2.2) Logical Systems C 


Fe 
# Make everything and install in the library directory designated by the 
# environment variable TLIB. 


tmatlib: 
make -f 

make -f 

make 
make 
make 
make 


$ (THISMAKEFILE) 
$(THISMAKEFILE) 
$(THISMAKEFILE) 
$ (THISMAKEFILE) 
$(THISMAKEFILE) 
$(THISMAKEFILE) 
$ (THISMAKEFILE) 


$(T414LIBNAME).t11 : 


CREATE T7414 VERSION OF THE LIBRARY 


$(T414LIBNAME).t11 
install4 

tclean 
$(TSOOLIBNAME).t11 
install8 

tclean 
install_headers 


$(TRL4FILES) 


tlib $(T414LIBNAME) -b $(TLIB4FILES) 


Clargs.trl4 : 


Clargs.h clargs.c 


pp  Clargs.c $(PPOPT4) 
tcx clargs.pp $(TCXOPT4) 
tasm clargs.tal $(TASMOPT) 
comm.trl4 : comm.h comm.c 

pp comm.c $(PPOPT4) 
tcx comm.pp $(TCXOPT4) 
tasm comm.tal $(TASMOPT) 


complex.tr14 


> complex.h complex.c 


pp complex.c $(PPOPT4) 
tcx complex.pp $(TCXOPT4) 
tasm complex.tal $(TASMOPT) 


generate.tr14 


> generate.h generate.c matrix.h memory.tr14 


pp generate.c $(PPOPT4) 
tcx generate.pp $(TCXOPT4) 
tasm generate.tal $(TASMOPT) 


hcube.trl4 : hcube.h hcube.c 

pp  hcube.c $(PPOPT4) 
tcx hcube.pp $( TCXOPT4) 
tasm hcube.tal $(TASMOPT) 
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146 
147 
148 
149 
150 
151 

152 
153 
154 
155 
156 
157 
158 
159 
160 
161 

162 
163 
164 
165 
166 
167 
168 
169 
170 
ie 

i2 

173 
174 

175 
176 
177 
178 
179 
180 
181 

182 
183 
184 
185 
186 
187 
188 
189 
0 
91 
192 
193 
194 
195 


machine.tr14 


>; machine.h machine.c 


pp machine.c $(PPOPT4) 
tcx machine.pp $(TCXOPT4) 
tasm machine.tal $(TASMOPT) 


mat_ops.tr14 


> mat_ops.h mat_ops.c matrix.h 


pp mat_ops.c $(PPOPT4) 

tcx mat_ops.pp $(TCXOPT4) 

tasm mat_ops.tal $(TASMOPT) 

math.trl4 : math.h math.c 

pp math.c $(PPOPT4) 

tcx math.pp $(TCXOPT4) 

tasm math.tal $(TASMOPT) 

matrixio.trl4 : matrixio.h matrixio.c ascii.h matrix.h memory.trl14 


PP 
tcx matrixio 


tasm matrixio 


memory .tr14 


matrixio. 


C $(PPOPT4) 
.pp $(TCXOPT4) 
tal $(TASMOPT) 


> memory.h memory.c matrix.h 


pp memory.c $(PPOPT4) 
tcx memory.pp $(TCXOPT4) 
tasm memory.tal $(TASMOPT) 


num_sys.tr14 


: num_sys.h num_sys.c matrix.h 


pp num_sys.c $(PPOPT4) 

tcx num_sys.pp $(TCXOPT4) 
tasm num_sys.tal $(TASMOPT) 
sep.trl4 : sep.h sep.c 

pp sep.c $(PPOPT4) 

tcx sep.pp $(TCXOPT4) 
tasm sep.tal $(TASMOPT) 


taming.trl4 : 


timing.h timing.c 


pp timing.c $(PPOPT4) 
tcx timing.pp $(TCXOPT4) 
tasm timing.tal $(TASMOPT) 


vec_ops.tr14 


> vec_ops.h vec_ops.c 


pp vec_ops.c $(PPOPT4) 
tcx vec_ops.pp $(TCXOPT4) 
tasm vec_ops.tal $(TASMOPT) 


—_— oe ee ee we ee ee ee ee em om ow ow oe = == 


CREATE T800 VERSION OF THE LIBRARY 


196 $(TSOOLIBNAME).t11 : $(TRL8FILES) 
197 tlib $(TS8OOLIBNAME) -b $(TLIB8FILES) 
198 

199 Clargs.trl8 : clargs.h clargs.c 


200 pp clargs.c $(PPOPTS8) 
201 tex clargs.pp $( TCXOPTS8) 
202 tasm clargs.tal $(TASMOPT) 
203 

204 comm.trl18 : comm.h comm.c 

205 pp comm.c $(PPOPTS) 
206 tcx comm.pp $(TCXOPTS8) 
207 tasm comm.tal $(TASMOPT) 


208 
209 complex.tr18 : complex.h complex.c 


210 pp complex.c $(PPOPTS) 

211 tex complex.pp $(TCXOPTS) 

212 tasm complex.tal $(TASMOPT) 

213 

214 generate.tr18 : generate.h generate.c matrix.h memory.trl8 
215 pp generate.c $(PPOPTS) 


216 tcx generate.pp $(TCXOPTS8) 
217 tasm generate.tal $(TASMOPT) 
218 

219 hcube.tr18 : hcube.h hcube.c 


220 pp  hceube.c $(PPOPTS) 

221 tex hceube.pp $(TCXOPT8) 

222 tasm hcube.tal $( TASMOPT) 

223 

224 machine.trl8 : machine.h machine.c 
225 pp machine.c $(PPOPTS) 

226 tcx machine.pp $(TCXOPTS8) 


227 tasm machine.tal $( TASMOPT) 
228 
229 mat_ops.trl18 : mat_ops.h mat_ops.c matrix.h 


230 pp mat_ops.c $(PPOPTS8) 
231 tex mat_ops.pp $(TCXOPTS8) 
232 tasm mat_ops.tal $(TASMOPT) 
233 

234 math.tr18 : math.h math.c 

235 pp math.c $(PPOPTS8) 
236 tcx math.pp $(TCXOPTS) 
237 tasm math.tal $(TASMOPT) 
238 


239 matrixio.trl18 : matrixio.h matrixio.c ascii.h matrix.h memory.trl18 
240 pp matrixio.c $(PPOPT8) 
241 tex matrixio.pp $(TCXOPTS) 
242 tasm matrixio.tal $(TASMOPT) 


243 
244 memory.trl18 : memory.h memory.c matrix.h 
245 pp memory.c $(PPOPTS8) 
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246 tcx memory.pp $(TCXOPTS) 

247 tasm memory.tal $(TASMOPT) 

248 

249 num_sys.trl18 : num_sys.h num_sys.c matrix.h 
250 pp num_sys.c $(PPOPTS) 

251 tex num_sys.pp $(TCXOPT8) 

252 tasm num_sys.tal $(TASMOPT) 

253 

254 sep.trl18 : sep.h sep.c 

255 pp sep.c $(PPOPTS8) 

256 tcx sep.pp $(TCXOPTS8) 

257 tasm sep.tal $(TASMOPT) 

258 

259 timing.tr18 : timing.h timing.c 
260 pp timing.c $(PPOPTS) 

261 tcx timing.pp $(TCXOPTS8) 

262 tasm timing.tal $(TASMOPT) 

263 

264 vec_ops.tr18 : vec_ops.c vec_ops.h 
265 pp vec_ops.c $(PPOPTS) 

266 tcx vec_ops.pp $(TCXOPTS) 

267 tasm vec_ops.tal $(TASMOPT) 

268 

269 

ee ——-—————————=— COP eCUR RARE hel ne DIRECTORY | (--—-——————— = ———— 
271 

77 inetall4: 

273 copy $(T414LIBNAME) .t1l $(TLIB) 
274 

275 install8: 

276 copy $(TSOOLIBNAME).tll $(TLIB) 
277 

278 

279 # ---------- COPY HEADER FILES TO STANDARD INCLUDE DIRECTORY ---------- 


280 


281 install_headers: 


282 copy ascii.h $(TLIB)\..\include 
283 copy macros.h $(TLIB)\..\include 
284 copy matrix.h $(TLIB)\..\include 
285 copy clargs.h $(TLIB)\..\include 
286 copy comm.h $(TLIB)\..\include 
287 copy complex.h $(TLIB)\..\include 
288 copy generate.h $(TLIB)\..\include 
289 copy hcube.h $(TLIB)\..\include 
290 copy machine.h $(TLIB)\..\include 
291 copy mat_ops.h $(TLIB)\..\include 
292 copy math.h $(TLIB)\..\include 
293 copy matrixio.h $(TLIB)\..\include 
294 copy memory.h $(TLIB)\..\include 
295 copy num_sys.h $(TLIB)\..\include 


Ze 


296 
297 
298 
299 
300 
301 
302 
303 
304 
305 
306 


310 
311 
312 


copy sep.h SC ILIB) V4 \inelude 
copy timing.h $(TLIB)\..\include 
copy vec_ops.h $(TLIB)\..\include 


# ----------======== 3.) FILE MANAGEMENT & UTILITIES ========---------- 
# 

# This section makes short work of a few useful/routine tasks. 

# 

Yo mmm mm mmo ESS SSS SS ESSE ESTES ESSE SSS SSSSSSSSSSSsss ress sss SSS 22 SSSSES25-e<e<<----— 
Bo mn nm nm Hn a wn er ee ee ees sssesse=s= 3.1) Intel iPSC/2 Sssssssssssssss-------- 
xe 

iclean: 

rm $(OBJECTS) 

# ---------- SS SSSSSSS=== 3.2) Logical Systems C Soret rsrsrsrsrsrr---------- 
# 

tclean: 

del *.pp 

del *.tal 

del *.trl 

# EOF matlib.mak -------- enn nnn nnn nn rn enn ns er tenn enn--- 
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B. NETWORK INFORMATION FILES 


hyprcube.nif This Network Information File gives a fairly complete description of 


the hardware configuration used to perform the transputer work. 


tO 
bo 
pom 


hyprcube.nif 


bt pm 
—_ © 


Oo wN HM OT & WH WB 


s=ss====  WETWORK INFORMATION FILE ========------------- 


SOURCE : hyprcube.nif 

VERSION : ad 

DATE ; O9 September 1991 

AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 
USAGE ; ld-net hyprcube 

EDITING : replace ‘rootcode’ with the code to run on the root 


replace ‘nodecode’ with appropriate code(s) for the nodes 


meee eee a a a a a eS ee sstsr=e REFERENCES Srrrrstrssrstrrstsre—----- -- - - 


[1] Inmos. IMS BO12 User Guide and Reference Manual. Inmos Limited, 
1988, Fig. 26, p. 28. 


we nnn SSS SSS SSsSssst DESCRIPTION SSSssSsstrssssse------------ 


Network Information File (NIF) used by Logical Systems C (version 89.1) 
LD-NET Network Loader. This file prescribes the loading action to take 
place when the ‘ld-net’ command is given as in USAGE above. 


wee ee = = = ee EARDWARE PREREQUISITES Sessa sssss-—-—-- - - - - - - 


NOTE: There are three node numbering systems: the one created by Inmos’ 
CHECK program, the Gray code labeling, and the NIF labeling. Since all 
three will be used on occasion, I will prefix node numbers with aC, G, 
or N to identify which system I am using! 


The IMS BOO4 and IMS BO12 must be configured correctly. The BO004’s T1414 
has link O connected to the host PC via a serial-to-parallel converter, 
link 1 connected to the IMS B012 PipeHead, link 2 connected to the T212 
(communications manager (not used here)] on the B012, and link 3 
connected to the IMS B012 PipeTail (see [1]). By the way, link 2 from 
the BOO4 goes to the the ConfigUp slot just under the PipeHead slot 
(this connects it to the T212). Finally, the BO004’s Down link must run 
to the B012’s Up link. 


-------------==== SETTING THE C004 CROSSBAR SWITCHES ====------------- 


Once you have connected the hardware in the fashion mentioned above, 
the system is ready to be transformed to a hypercube. Three codes by 
Mike Esposito are used here: t2.nif, root.tld, and switch.tld. I have 
a batch file called ‘makecube.bat’ that performs a ‘ld-net t2’ also. 


Mike’s code passes instructions to the T212 on the B012; which, in-turn 
tells the C004’s how to connect their switches. After the code has 


to 
to 
bo 


hypreube.nif 


Sai executed, the (very specific) configuration that we are looking for 
52; will exist. Specifically, the following (output from CHECK /R) is what 
53; this process gives us: 

$4 ; 

55 ; check 1.21 

56 ; # Part rate Mb Bt [ LinkO Linki Link2 Link3 ] 

ST) O T414b-15 0.09 0 [ HOST 1 2 S62) 

58 ; 1 T800c-20 0.80 1 [ 4:3 0:1 5:1 6:0 ] 

59 ; 272 =-17 0.49 1( C004 O:2 oe C004 J 

60 ; 3 T800c-20 0.80 2 [ i33 8:2 0:3 9:0 ] 

6i0 ; 4 T800c-20 0.76 3 [ 9:3 10:2)902:1 1:0 ] 

62 ; 5 T800d-20 0.90 1 [ 8:3 f22.- 10-1) geiono) 

63 ; 6 T800d-20 0.76 0 [ 1-3 > 812.2 7: Yi 08) 

64 ; 7 T800d-20 0.76 3 [ 13:3 6:2 14:1 3:0 ] 

65 ; 8 T800d-20 0.90 2 [ 14:3 15:2 324 5:00 

66 ; 9 T800c-20 0.77 0 [ 3-3 13.26) 1571 4:0 ] 

Bye 10 T800d-20 0.90 2 [ 16:3 52 4:1 15:0] 

6a.) 11 T800d-20 0.90 1 [ 6:3 4:2 16:1 13:0] 

69 ; 12 T800d-20 0.77 0 [ 6:3 16:2 6:1 14:0) 

7On; 13 T800d-20 0.77 3 [ 11:3 17:2 9:1 707) 

ale; 14 T800c-20 0.90 1 [ 12:3 Tp are alygeal 8:0 ] 

ae 15 T800c-20 0.90 2 [ 10:3 9:2 8:1 17-0) 

3 16 T800c-20 0.76 3 [ 17:3 11:2 12:1 10:0) 

Tap; 17 T800d-20 0.88 2 [ 15:3 14:2 13:1 16:0] 

Gees 

76; Here node CO is the root transputer (on the IMS B0O04) and node C2 is 
cae the T212 (on the IMS BO12). The other sixteen nodes are the T800’s 
con; that are used for the work. A logical interconnection topology is 
aS: described below. 

30: 

81 ; 

S25 wee ee ae a a eee se sessass=se TOPOLOGY srscrstsesssssssesr------------- 
Son: 


84 ; The physical interconnection scheme described above is an actual 4-cube 
85 3 with one exception. The root node (CO) is situated BETWEEN nodes C1 


86; and C3 (which would be connected directly in the usual 4-cube). This 
S045 gives us two 3-cubes: one whose node labeling is GOxxx and the other, 
BS; whose node labeling is Gixxx (where the xxx represents all permutations 
89 ; of 3-bits). These are the usual three cubes, and they will exist if we 
90 ; define the node numbering/labeling correctly. 

ote 

92 ; 

930, 0 Tomtom cc ceecSssssssssssssss== STRATEGY 8 =============s5=5------------- 
94 ; 

95 ; The node labeling established by the WIF is available via the variable 
96 ; _node_number (see <conc.h>) in source code. Therefore, we would like a 
97 ; smart labeling scheme in the WIF file so that programming is easier. 


98 ; This, of course, is subject to the restriction that WIF labels begin 
99 ; with N1 and so on. 
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hypreube.nif 


101 ; One such method would be to define a WIF labeling so that the Gray code 
102 5: label for a node would be (_node_number - 2). In fact, this is 

103; possible and the adjacencies defined below allow us to realize this 

104 ; feature. Below, node NO is the host PC, node Wi is the root transputer 
105 ; (1414 on the BOO04), W2 through Wi7 correspond to GO through G15 (the 
106%: nodes of a 4-cube), and Wi8 is not used (but it’s the T212). 

JO7os 

108 ' eee ee ewe KH em em eM eM SSS SS SS SSS SS SS Sl SSS eee eee ee ee eee ee ee ee eee ee ee ee ewww ew em em ew ew ew eK | 
109 

110 

111 host_server cio.exe; (default) 

112 

ise: TRANSPUTER RESET DESCRIPTIOW OF LIWK CONNECTIONS 

114 ; WODE LOADABLE COS a 
Pie: ID CODES@ tid) FROM: LINKO LINK1 LIWK2 LINK3 

116 ; a SS SS SSS SS SS Sa = o=o--—— -——— —— to ”-_——<——<—— See = 

13% 1. rootcode, x0; 0, 22 Fi 10; BOO4 
118 Ze nodecode, rig 4, 1, 3S; 6; B01i2 
119 3. nodecode, TZ: ii, Pie 5. tee 

120 4, nodecode, V5, 12: IS. 8, 2: 

121 5, nodecode, 735 9, She 4, 13% 

122 6, nodecode, ri 2 1; 14, 8; 

123 t nodecode, r9, 3. 9, 6, 1S; 

124 8, nodecode, r4, 6, 4, 9, 16; 

125 9, nodecode, ro. 17; 8, ya §; 

126 10, nodecode, rm, 14, ph i 12; 

127 11, nodecode, ris. LS). 13s 10, 33 

128 12: nodecode, r16, 10, 16, 13% 4; 

129 13; nodecode, ga Was 5, i; 1 We | a 

130 14, nodecode, ré6é, 16, 6, 1S. 10; 

131 NUS. nodecode, ri4, a 14, 1%. pe 

132 16, nodecode, 5 ap ae 8. Air 125 14; 

133 17, nodecode, mS, 13) 15; 16, 9; 

134 ; 18, switch, si, ; 1 ; ; T2i2 
135 

136 

137 5 totter rere ne SSSSsssssee EOF hyprcube.nif sSsssssSSSSS5------------- 
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C. STANDARD FILES 


macros.h This header file gives several C macros that are used in other programs. 


matrix.h This header file establishes the standard definition of a matrix. 


No 
tr 
It 


macros.h 


ownna ao . W KH = 


/* ————— ee PROGRAM INFORMATION t+? —+—-$—F $F 
* 
* SOURCE : macros.h 
* VERSION : 123 
* DATE : 14 September 1991 
* AUTHOR : =/Jonathan E. Hartman, U. S. Waval Postgraduate School 
* 
www wm wm wm wm Mm Mm ww eM ew SSS SS SS SS SS SS SB See ee ee SS SSS SS SS SS SSS SS SH SSS Se wn ewww ewww - = 
* / 
#define MAX(x,y) CCCz)- 3 °Cy))). 2 (xe GD 
#define MIN(x,y) C(x) >3Cy 02 Cyaaee 
#define POW2(n) ((1) << (n)) 
[%* ----------- SS SsSsSsssas== EOF macros.h SSSSSSsSSSSS2SSeeeeer~--7-7 * / 
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matrix.h 


o ont wn wo 2 WD NH 


a 
~ © 


28 


™ 
% 


* 

* SOURCE 

* VERSION : 
* DATE 

* AUTHOR 

* 

* 


2.0 


matrix.h 


O2 September 1991 
Jonathan E. Hartman, U. S. Waval Postgraduate School 


DESCRIPTION Sots assert rrr ------ +--+ + 


* A header file for a family of functions designed to work with matrices. 


#define 
#define 
#ifndef 
#define 
#endif 

#ifndef 
#define 
#endif 

#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


BASE_TEN 
CURRENT 


EXIT_FAILURE 
EXIT_FAILURE 


Bry SsUCCESS 
EAL A OUCCESS 


FAILURE 
FALSE 
LINE_LENGTH 


MAX_NAME_LENGTH 


KO 

OFF 

ON 
OWE_BYTE 
ONE_MEMBER 
PREVIOUS 
SUCCESS 
TRUE 
HYPE-CHAR 
TYPE_DOUBLE 
TYPE_FLOAT 
TYPE_INT 
WES 


/* for Complex_Type */ 


10 


80 


RBPWNRrF OF OO OF KF FY OO O 


tO 
to 
| 


5} /* moe ee a a a eee ssessse TYPE DEFINITIONS Ssssssscsssen----------- * / 


54 typedef struct { 

56 char *name ; 
57 int rows, 
58 cols; 


59 double **matrix; 


61 } Matrix_Type; /* default/standard is type double */ 


65 typedef struct { 

67 char *name; 
68 int rows, 
69 cols; 


70 Complex_Type **matrix; 


72 } Complex_Matrix_Type; /* type Complex_Type is in complex.h */ 


76 typedef struct { 

78 char *name; 
79 int rows, 
80 cols; 


81 double **matrix; 


83 } Double_Matrix_Type; 


87 typedef struct f{ 


88 

89 char *name; 
90 inc rows, 

91 cols; 

92 float **matrix; 
93 

94 } Float_Matrix_Type; 
95 

96 

97 

98 typedef struct { 

99 

100 char *name ; 


bo 
to 
oA) 


101 int rows, 

102 cols; 

103 inc **matrix; 

104 

105 } Int_Matrix_Type; 

106 

107 

108 /® -n-- n-ne nr SSE SS SSSSSee EOF matrix.h ==Sssesess52-----------—- * / 


i) 
to 
CO 


D. SOURCE CODE FILES 


There is one header file and one (.c) source code file for each remaining member 


of the library, so the filename is given without the suffix. 


allocate Memory allocation and management functions. 

clargs For processing command-line arguments. 

comm Communications functions for the hypercubes. 

complex Complex numbers and operations. 

epsilon Machine precision functions. 

generate Matrix generation functions. 

io Input/output (JO) functions. 

mathx A small extension to the C math library. 

num-_sys Various number systems (binary, decimal, hexadecimal). 
ops Matrix and vector operations. 


timing Functions for timing. 


Again, however, most of the source code has been omitted and only the header 
files remain. The singular exception is complex.c because this source contains an 


algorithm referenced earlier in the thesis. 


+ + © &© © © &© & © &© &© & @#& & @# & #& & H& & # & 


SOURCE : allocate.h 

VERSION : 250 

DATE : O09 September 1991 

AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 


eee e e-em em em me SE ETT SSE SSSsee DESCRIPTIOW Sots ttsostststrrae ee - - - e - - 


Declarations of functions associated with memory allocation. 


ween ew = —— — ee eS Eee ssstst LIST OF FUNCTIONS foster rcss sri r—------------- 


cmatalloc() 

intvecalloc() 

matalloc() 
we eee — — ee Sesser FUNCTION DECLARATION Toss scocstoot=---------- - - 
PURPOSE: This function performs the memory allocation for a matrix 


structure (of the Complex_Matrix_Type) using the C function 
Galloc(); Additionally, it filleythe “rows and “col &" 
fields of the matrix structure returned with the parameters 
passed to the function. If a structure is returned (see 
"RETURNS"), then its "rows" and “cols" fields will be 
filled with the correct values. The structure type is 
defined in “matrix.h". 


INCLUDE: "allocate.h" 
CALLS: calloc() 
CALLED BY: 
PARAMETERS: int rows the number of rows in the desired matrix 
int cols the number of columns in the desired matrix 
RETURNS: A pointer to the structure if successful; WULL otherwise. 


The WULL case includes non-positive rows or cols in addi- 
tion to the obvious allocation failure. 


28)I 


51 * 
52 * 
53 * A = cmatalloc(7, 7); 
54 * 


EXAMPLE: Complex_Matrix_Type *A; 


58 #ifdef PROTOTYPE 

60 Complex_Matrix_Type *cmatalloc(int rows, int cols); 
62 #else 

64 Complex_Matrix_Type *cmatalloc(); 


66 #endif 


72 [* -------------sssss222==5 FUNCTION DECLARATION =SSSsrsssrc------------- 


PURPOSE: This function performs the memory allocation for a vector, 
v, of num_elements integer elements. 


INCLUDE: "allocate.h" 

CALLS: calloc() 

CALLED BY: 

PARAMETERS: See PURPOSE. 

RETURNS: A pointer to the array if successful and NULL otherwise. 


EXAMPLE: int desired_size_of_v = 7, 
*V; 


v = intvecalloc(desired_size_of_v); 


+ & + # #& + # &© &@ &@ &# &# & 28 H HH BD H 


96 #ifdef PROTOTYPE 


98 int *intvecalloc(int num_elements); 
99 
100 #else 


101 

102 int *intvecalloc(); 

103 

104 #endif 

105 

106 

107 

108 

109 

110 /# wweee enna SSSeee==== FUNCTION DECLARATION $$==========------------- 

111 

112 PURPOSE: This function performs the memory allocation for a matrix 
structure using the C function calloc(). Additionally, it 
fills the "rows" and “cols” fields of the matrix structure 
returned with the parameters passed in to the function. 
If a structure is returned (see “RETURNS'), then its "rows" 
and "cols" fields will be filled with the correct values. 
The structure type is defined in “matrix.h". 


INCLUDE: “allocate.h" 
CALLS: calloc() 
CALLED BY: 


PARAMETERS: int rows the number of rows in the desired matrix 
int cols the number of columns in the desired matrix 


RETURNS: A pointer to the structure if successful; WULL otherwise. 
The NULL case includes non-positive rows or cols in addi- 


tion to the obvious allocation failure. 


EXAMPLE: Double_Matrix_Type *A = matalloc(7, 7); 


—_ 
KO 
w 
+ #+ + #+ & + + # # & # # & & & & & & H& #H H HH 8H 


137 

138 #ifdef PROTOTYPE 

139 

140 Double_Matrix_Type *matalloc(int rows, int cols); 
141 

142 #else 

143 

144 Double_Matrix_Type *matalloc(); 
145 

146 #endif 

147 


clargs.h 


ombengn a hf WD Ww = 
~ 


—-_— —=&F=& = > Pt 
&m iW Oo = © 


49 


* *&©& © & & & 


* 
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SOURCE : clargs.h 

VERSION : 1.5 

DATE : O9 September 1991 

AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 


ee wn wn nn a HK SSS SSS SST TSSs DESCRIPTION Sasesssrsssss------------- 


This header file gives the declarations to accompany clargs.c. These 
files provide a standard (if somewhat limited) way of handling command- 
line arguments. The objective is to handle: 


1.) Simple boolean arguments like "if -v exists, set verbose = TRUE". 
We will call such an argument a ‘simple’ argument type. This 
type of argument can be recognized by the fact that it has no 
sub-arguments (the sub-argument count, subargce == 0). 


2.) Arguments with sub-arguments to be interpreted as numbers. We 
will this a ‘complex’ argument type. Suppose that we want to set 
int dim = 3 when the command line arguments contain "-d 3 ". 

This case implies several requirements: 


a.) First, we must know in advance how many sub-arguments the 
argument has--we’ll call this subarge (in this case we are 
expecting one sub-argument, so the caller would have set 
subarge = 1). 


b.) Secondly, we must know how to interpret each sub-argument 
[i.e., what type is the sub-argument? Is it a double or long 
(float and int can be handled by type casting)?] 


We will call this kind of argument a complex argument type. They 
can be recognized as those with subargce > 0. 


Here is the strategy. The user makes a list of valid command-line 
arguments by creating an array of pointers to structures of type 

Arg Struct. We will call this the option list, (Arg_Struct *) optv([]. 
The code assumes that you can do something like this at the top of your 
source: 


#define MAX_NUMBER_OF_ARGS 3 

static Arg_Struct *optv(MAX_NUMBER_OF_ARGS] ; 
Let (int) optc, be the option count (number of options). Every element 
in (pointed to by) the option list is a structure of type Arg_Struct 


defined below. By using the standard C arge and argv; and by creating 
and passing optc and optv around, we can manipulate command-line 
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51 
$2 
$3 
54 
55 
$6 
57 
58 
$9 
60 
61 
62 
63 


arguments just about however we want. 


* 
* the 
* 
* 


install_complex_arg() 


structure. 


ee eee ee ee oe 
-— = ee ee ee i oe 


—_— om ee ee oe oe 


interpret_args() 


* 
* 
* install_simple_arg() 
* 
* 


#ifndef 
#define 
endif 
#Hifndef 
#define 
tendif 
#ifndef 
#define 
#endif 
#ifndef 
#define 
#endif 
#tifndef 
#define 
#endif 
#ifndef 
#define 
#endif 


/* 


EXIT_FAILURE 
BALT_FPRILURE 


EXIT_SUCCESS 
EXIT_SUCCESS 


FALSE 
FALSE 


NULL 
NULL 


SUCCESS 
SUCCESS 


TRUE 
TRUE 


The next step is to understand 
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* The maximum number of characters in an argument name, MAX_ARGLEN is a 

* relatively arbitrary thing....make it whatever you want. The DOUBLE 
and LONG manifest constants are assumed to be used for values of 
subargi (see the structure below). 


i 


#define 
#define 


MAX_ARGLEN 
DOUBLE 


#define LONG 
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101 /# -------------ss=ss======= DATA STRUCTURES  ============------------- 
102. * 

103 * argname The (string) name of a valid argument. For instance, if 

104. * you want the simple argument "-v", then argname[] would be 
105 # "-y". If you have a complex argument that will appear as 
106 * "-number 3 4.5 6.7", then argname will be "'-number" and you 
107 * must use the sub-argument variables below to handle the 

108 * integer and two floating-point values. 

109 * 

110 * subarge Consider the "-number" example again. There are three sub- 
111 * arguments (3, 4.5, and 6.7) so the sub-argument count would 
112. * be 3. 

113. * 

114 * gsubargi{} This array tells us how to interpret the subarguments. For 
115 * instance, again using the "-number" example above, we would 
116 * set subargifO] = LONG; subargil[i] = DOUBLE; and 

117. —* subargi[2] = DOUBLE. 

Weds 3 

119 * found This should is initialized to FALSE. The function 

120 * interpret_args() will set this field TRUE if the argname[] 
121 * appears on the command-line (in *argv[]). 

122 * 

123 * dsa([] This field is an array of double sub-arguments. 

124 * 

125 * Ilsa[] This field is an array of long sub-arguments. 

126 

127 * Consider the “-number" example again. After argument resolution, we 
128 * would find that dsa[0O] is not defined since subargi[0O] == LONG. 

129 * However, we can use subargi[] to verify that subargi(1] and subargi[2] 
130 * are DOUBLE. Knowing this, we can safely presume that the values with 
131 * CORRESPONDING index in dsa[) should be interpreted as doubles. That 
132 * is, dsa{1] will be a double value (4.5) and dsa[2] will also be a 

133 * double (6.7). In a similar manner, lsa[0] must be a long (3) and 

134 * lsaf{1] and 1lsa[2) are not defined. 

135 * 

136 ee a See ee Se Se Se See ee ee SSSseee a Ssesaeae ee] SSeS SS SS So oe ooo oe ow oe on or or or or ee 
137 */ 

138 

139 typedef struct { 

140 

141 char argname(MAX_ARGLEN]; 

142 

143 int subargce, /* how many subarguments expected */ 
144 *subargi, /* how to interpret subarguments */ 
145 found; /* set TRUE if the argument is found */ 
146 

147 double *dsa; /* double-valued sub-arguments */ 
148 long *lsa; /* long-valued sub-argument list */ 
149 


150 } Arg Struct; 


151 
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183 /® ---o mH HHH SSSSSSssaa FUNCTION DECLARATION SSSSSSSSs Serr rrr rer 
154 * 

155 * PURPOSE: To install a valid complex argument in the option list, 

156 * optv(). 

157 * 

158 * INCLUDE: “clargs.h" 

159 * 

160 * CALLS: strcpy() 

161 * 

162 * CALLED BY: 

163. * 

164 * PARAMETERS: int index; 

165 * Arg_Struct *optv([]; 

166 *™ const char *argname; 

167 * int *interpret, 

168 * subargc; 

169 * 

170 * The first three parameters are exactly like the corresponding ones for 
171 * dinstall_simple_arg(). Additionally, for complex arguments, we need to 
172 * pass in instructions concerning how many sub-arguments there are (i.e., 
173 * subargc) and how to interpret each. The array interpret[{] should be 
174 * filled with subarge elements when you call this function. The elements 
175 * should only be valid ones (e.g., DOUBLE, LONG). 

176 * 

P77 a a eM SSS SSS SSS SS SSSSSSSSSSSSSSSSSSSSSSSSSSSSS SSS SS 5-20 222 e%e 22---- 
178 */ 

179 


180 #ifdef PROTOTYPE 

182 void install_complex_arg(int index, Arg Struct *optv[], 

183 const char *argname, int *interpret, 
184 int subargc); 

185 #else 


187 void install_complex_arg(); 


189 #endif 


195 /*® ore e-em eee Sessssessa FUNCTION DECLARATION $+ —4—-4—-4—-4—-$ $7 


* 

* PURPOSE: To install a valid simple argument in the option list, 
198 * optv[]. 

* 

* 


IKCLUDE: "clargs.h" 
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CALLS: strcpy() 
CALLED BY: 
PARAMETERS: int index; 


Arg_Struct *optv[]; 
const char *argname; 


The ‘index’ gives the location of the option in the option list, 
optv[]. The function uses this index to install the argname at the 
proper location in optv[]. For instance, set this variable to zero for 
the first option in the list. Mormal C indexing convention applies; 
namely, 0 <= index < MAX_WUMBER_OF_ARGS. The ‘argname’ is the string 
that you want recognized as a valid argument. For instance, suppose 
that you want a timing argument to be recognized whenever "-t'" appears 
on the command line. Then you would supply "-t" in this place. 


#ifdef PROTOTYPE 


void install_simple_arg(int index, Arg_Struct *optv[], 
const char *argname) ; 


Helse 


void install_simple_arg(); 


#tendif 
[# won --------Ssss====== FUNCTION DECLARATION ==========------------- 
* 
* PURPOSE: Once the user has defined an appropriate option list, 
* optv[], with optc options, this function parses the 
* command-line arguments (as given by arge and argv) and fills the 
* ‘*optv([] structures appropriately. For instance every valid (exists in 
* optv ==> valid) argument that appears on the command line will result 
* in the corresponding optv structure’s ‘found’ field being set to TRUE. 
* The function also interprets sub-arguments and fills dsa[] and/or lsa[] 
* accordingly. It assumes that the caller has established the desired 
* argname’s, subargc’s, and subargi’s. 
* 
* INCLUDE: “"clargs.h" 
* 
* CALLS: printf () 


2) 
—_— 

a) 
hi 
oa 


strcmp() 
strtod() 
strtol() 
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260 */ 

261 

262 

263 #ifdef PROTOTYPE 

264 

265 void interpret_args(int argc, char **argv, int optc, Arg_Struct **optv); 
266 

267 #else 

268 

269 void interpret_args(); 

270 

271 #endif 


CALLED BY: 


PARAMETERS: As described in PURPOSE. 
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mee - ===> =>>>>= PROGRAM INFORMATION Ssotcrrrrsrs---- ew 
SOURCE : comm.h 

VERSION : 2.5 

DATE : 14 September 1991 

AUTHOR : Jonathan E. Hartman, VU. S. Waval Postgraduate School 

ewe ewe = — | | eS Seer ssssrse== DESCRIPTION Soscrssss tris ------ - --- -- = 


This header file gives manifest constants and function specifications 
for comm.c. These files contain communication (and related) functions 
for a normal hypercube topology and a hybrid topology. Unfortunately 
the code is a bit busy with #ifdef’s, but the purpose of these files is 
to make hypercubes a little more transparent. This makes the comm.h 
and comm.c files a bit hard to read, but you should be able to recoup 
this loss when it comes time to write a particular application. 


eee ee ee TOPOLOGIES Brrr srr riers --- - - - --- - - 


The functions specified below have been designed to work on three very 
different machines. First, the Intel iPSC/2 with a normal hypercube of 
order 0, 1, 2, or 3 18 handled. A normal hypercube of transputers is 
next on the list (also order 0, 1, 2, or 3). Finally, there is a 
hybrid topology of transputers that is handled. The normal hypercubes 
need almost no introduction. We have a host or root processor/program 
together with programs running on the nodes. I will use host and root 
interchangeably here, although ‘host’ is properly associated with the 
Intel machine and ‘root’ is the more correct/descriptive term when the 
subject 1s transputer networks. The hybrid topology deserves a more 
careful introduction. 


The hybrid topology is a network of Inmos transputers (PC host with an 
IMS BOO4 board and a T414 linked to sixteen T800 processors on an IMS 
B0O12 board) arranged so that the ‘root’ is situated between nodes zero 
and eight of a 4-cube. This means that nodes O and 8 are NWOT directly 
connected. The functions made for this topology compensate for this 
situation. Instead of trying to describe each function, I will simply 
remark that the most natural way to treat this problem is (more-or- 
less) as two 3-cubes attached to the root. A more careful description 
of how each problem is handled may be found in the code for the parti- 
cular function. 


In summary, the transputer portions of the code depend upon: (1) a very 
specific hardware configuration, (2) the appropriate MIF file to 


support the usual Gray code in a convenient way 


[ mynode() == _node_number - 2 ], 


240 


51 * and (3) a particular link arrangement like that can be created by Mike 
coe * Esposito?s, t2enif, root.tlid, and swvitch.tld. 

oo * 

54 * DETAILS: Look for additional details in hyprcube.nif. 

55 * 

56 * 

57 Ree nn a a ee SESS Sssss===s= PREREQUISITES Sassssessssste--- - - - - - ~~ - 


59 Before using any of the functions involving send() or receive(), the 
60 host (or root) program must initialize_hypercube(). For transputer 
61 applications, EACH of the WODES must initialize_hypercube() too, and 
62 you need to be sure that a hypercube exists in hardware and that your 
63 WIF describes a hypercube with the usual Gray code. You must define 
64 the global variables {Channel *ic[], *oc{];} because the code depends 


upon their existence. Both of these vectors must be of length 
(cubesize+1) as described in the preface to initialize_hypercube(). 


The cubesize and dimension that you use with the transputer implementa- 
tion determine the cube. Even though you actually have sixteen T800’s 
in the cube, the cubesize and dimension that you use will determine the 
portion that actually gets used. Wote that both the usual hypercube 
and the hybrid 4-cube are built upon the same hardware and link setup. 
Many of the functions declared below DEPEND upon the proper call to the 
initialize_hypercube() function. To avoid difficulty, observe the 
guidelines given with this function! Additionally, in the transputer 
case, you will need to make sure that you include <conc.h>. 


mH 
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* # #@# &*& &@ @ # # & & & @& @& @& &@ @& @& % & HH 


7Q 0 Beem mmr rr re eH SS SSSSsssssa LIST OF FUNCTIONS SSSsssssssstennn eee 
80 

81 * coalesce() 

82 * cubecast() 

83 * cubecast_from() 

84 * directional_exchange() 
8 * directional_receive() 
86 * directional_send() 

87 * hamming_distance() 

88 * initialize_hypercube() 
89 * least_dimension() 

90 * link _number() 

9:1 * linkin() 

92 * linkout() 

93 * receive() 

94 * send() 

95 * submit() 

96 * 

97 RB ww wm Mm Mm Mm Mm Mm Mm mM MM eM Sea ae a eS SS SS SSS SS SSS SSeS Se we ee ew ew em eee ee 
98 */ 

99 

100 
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101 /* --=-==--------==== MACROS & MAMIFEST CONSTAMTS ====-—-------———— * / 
102 

103 #ifdef TRANSPUTER 

104 

105 #define myhost() = 

106 #define mynode() (_node_number - 2) /* depends upon <conc.h> */ 
107 

108 Helse /* iPSC/2 */ 

109 

110 #define ALL_WODES =1 

111 #define ALL_PIDS ae 

112 #define AHY_RHODE 0 /* for receive(from any node, ... ) x / 
113 #define ANY_TYPE 3, /* first non-force-type message * / 
114 #define ARBITRARY_TYPE 0 /* don’t care x / 
115 #define KEEP_TIL_RELCUBE 1 /* for getcube() */ 
116 #define NODE_PID 0 /* arbitrary ... don’t care * / 
117 #ifndef WULL 

118 #define NULL 0 

119 #endif 

120 

121 #endif 

122 

123 

124 #ifndef FALSE 

125 #define FALSE 0 

126 #endif 

127 

128 #ifndef TRUE 

129 #define TRUE 1 

130 #endif 

131 

132 

133 /* -------------=S==SSS=5 FUNCTION DECLARATION $=========--------------- 
134 * 

135 * PURPOSE: This function performs the first step in the opposite of 
136 * the cubecast() function. That is, this one is used when 
137 * you want to collect information from the nodes in ‘higher dimensions’ 
138 * of the hypercube at the current node. You may want to perform some work 
139 * before forwarding this information down to the next lower dimension, so 
140 * the submit() function is given separately. 

141 * 

142 * Like the other functions in this file, coalesce() performs a somewhat 
143 * different task when executed in the hybrid 4-cube, so first we will 

144 * discuss the usual hypercubes. coalesce() is a null operation when 

145 * called from in the highest dimension [ if least_dimension(node) is 

146 * equal to dim }. Otherwise it performs the communication to receive 

147. * from higher dimensions (i.e., neighbors with larger node numbers). If 
148 * it is called from the host/root, it attempts to receive() from node 

149 * Zero. 

150 * 
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151 The coalesce() and submit() functions must be balanced properly across 


* 
152 * the nodes. The CALLER must take the necessary steps to be sure that 
153 * buf is large enough to hold ((dim - least_dimension(node)) * len) 
154 * bytes. That is, there will be (dim - least_dimension(node)) copies of 
155 * the message accumulated at the calling node. 
156 * 
157 * There are several exceptions in the hybrid 4-cube topology. Since the 
158 * root is connected to nodes 0000 and 1000, it must make sure that buf 
159 * can hold 2 copies of length, len. Then you should think of nodes Oxxx 
160 * as one 3-cube and nodes ixxx as another (more-or-less separate) 3-cube. 
161 * That is, there will be no exchanges in the ixxx direction between then. 
162 * To determine the size of buf at any node, use the following formulae: 
163. * 
164 * (3 - least_dimension(node)) * len, Nodes Oxxx 
165 * 
166 * (3 - least_dimension(node - 8)) * len, Wodes ixxx 
167 * 
168 * CAUTIONS: If you fail to allocate enough space for buf, you may find 
169 * that your program doesn’t work. 
170 * 
nl |6¢ The transputer implementation depends upon the parameter 
172 * ‘type’ being set equal to cubesize. 
lie .* 
174. * PREREQUISITE: initialize_hypercube() 
175 * 
176 * INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
177 * “comm. h" 
178 * 
179 * CALLS: least_dimension() 
180 * myhost() (macro given above) 
181 * pow2() “mathx.h" 
182 * receive() 
183 * 
184 * CALLED BY: 
185 * 
186 * EXAMPLE: Suppose we are ‘at’ node O and we want to coalesce() copies 
187 * of some object from all of the appropriate nodes. Let the 
188 * object be of size ‘len’ bytes. For concreteness, let the topology be a 
189 * hypercube of order 3 (i.e., dim == 3). We would allocate a large enough 
190 * buf to hold (dim * len) bytes, since least_dimension(0) == 0. That is, 
191 * node O will be receiving from all neighbors whose least_dimension() is 
192 * greater [in this case, that is ALL of its neighbors]; namely, 1, 2, and 
193 * 4. After the call, we would find the data from node 1 in the first len 
194 * bytes of buf; the data from 2 in the middle len bytes of buf; and the 
195 * data from 4 in the final len bytes of buf. The function 1s treated as 
196 * a multiple receive(), in increasing origin order, from the appropriate 
197 * neighbors. 
198 * 
199 * PARAMETERS: 

* 
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int node the coalesce()ing (receiving) node 

int dim the dimension of the hypercube 

char *buf a pointer to the beginning of the buffer where you want 
the message placed. 

long len the number of bytes to be received from EACH node in 
the next higher dimension that will be submit()ing. 

long type the type of the message (iPSC/2 applications only), or 
cubesize in the transputer case. 
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#ifdef PROTOTYPE 


void coalesce(int node, int dim, char *buf, long len, long type); 


Helse 


void coalesce(/* int node, int dim, char *buf, long len, long type */); 


tendif 


/* 
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s2s=255=== FUNCTION DECLARATION SSsrsssstenn nnn -- -- -- 


PURPOSE: This function is called from the root/host and all nodes to 

execute a broadcast to all p nodes. The host/root sends to 
node zero to start the process off. Let lg(n) denote log_2(n). This 
function performs the communication in lg(p) steps. For instance, node 
zero receives from the host in what we’ll call stage zero. Then, in 
Stage 1, node 0 passes the message to node 1. In stage 2, node 0 sends 
the message to node 2 and node 1 sends it to node 3. In stage three, 
nodes 0, 1, 2, and 3 each send the message to nodes 4, 5, 6, and 7 
(respectively). 


Then, in general, in stage i, the message moves into the ith dimension. 
If you prefer, you can think of a pointer starting (after the message 
arrives at node 0) at the rightmost bit (LSB) and indicating the direc- 
tion for the next transmission. The pointer moves left until it 
reaches the MSB. This is the final stage of the cubecast(). 


The hybrid 4-cube is implemented by sending the message from the root 
to nodes 0 and 8 first. Then node 0 performs the usual cubecast for 
the nodes that appear in the usual 3-cube. Wode 8 mirrors this action, 
filling the other three-cube with labels like ixxx. 


In all cases, buf is filled with an initial receive() from the proper 


node, and then it is used in retransmissions to other nodes. In any 
event, buf holds the message after execution. 


244 
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CAUTION: The transputer implementation depends upon the parameter 
‘type’ being set equal to cubesize. 
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PREREQUISITE: 


INCLUDE: <conc.h> 


initialize_hypercube() 


(Logical Systems C, version 89.1) 


"comm.h" 


CALLS: least_dimension() 
MIN() (macro from macros.h) 
myhost() (macro from above) 
pow2() "mathx.h'" 
receive() 
send() 


CALLED BY: 
PARAMETERS: 


int node 
int dim 
char *buf 
long len 
long type 


#ifdef PROTOTYPE 


the sending node 

the dimension of the hypercube 

a pointer to the head of the message 

the number of bytes to be passed 

the type of the message (iPSC/2 applications only), or 
cubesize in the transputer case. 


void cubecast(int node, int dim, char *buf, long len, long type); 


Helse 


void cubecast(/* int node, int dim, char *buf, long len, long type */); 


#tendif 


/* -------------========= FUNCTION DECLARATION 9 =========--------------- 


* + #8 8 


PURPOSE: This function is similar to cubecast() but more general. 
Here we do not assume that the message starts at the host 

or at node zero; it may start at any general source node, src. In fact, 

it may NOT be called from the root/host (use cubecast() in that case). 
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301 * If dim is the order of the hypercube, then src goes through dim stages, 
302 * passing the message to its neighbors. The sequence is defined by an 
303 * XOR operation that starts at bit 1 of src and moves up through bit din. 
304 * For instance, suppose src == & == 101b in the 3-cube (dim == 3). Then 
305 * src will first send to (101 XOR 001) == node 4, next to (101 XOR 010) 
306 * == node 7, and finally to (101 XOR 100) == node 1. Meanwhile, any time 
307 * that a non-source node gets the message, he begins the same process, 
308 * but only picks it up at the appropriate stage (the one after the stage 
309 * in which he received the message). 

310 * 

311 * PREREQUISITE: initialize_hypercube() 

312+ 

313. * INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
314 * “comm.h" 

315 * 

316 a Se a CALLS: directional_receive() 

317 * directional_send() 

318 * free() 

319 * least_dimension() 

320 * malloc() 

321 * pow2() “mathx.h" 

320% receive() 

323. * send() 

324 * sizeof() 

325 * 

326 * CALLED BY: 

327 * 

328 * PARAMETERS: 

329 * 

330 * int src the source 

331 * int node the number of the node calling this function 

332 * int dim the dimension of the hypercube 

333 * char *buf a pointer to the head of the message 

334 * long len the number of bytes to be passed 

335 * 

336 eee eee SSS SS SSK SSS SSS SSS SSB SSS SSS SS SSS SS SSK Me ee ee eee 
337 */ 

338 

339 

340 #ifdef PROTOTYPE 

341 

342 void cubecast_from(int src, int node, int dim, char *buf, long len); 
343 

344 #else 

345 

346 void cubecast_from(); 

347 

348 #endif 

349 

350 
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/* we an a ww ae ae ee SSS SSS FUNCTION DECLARATION SSSSESs ssw ew ww wwe we ee 
PURPOSE: To perform an exchange along a prescribed direction. The 
direction is given as an integer in {1, 2, 4, 8,...,2°dim}. 


This is because the direction is really a bit mask for the Gray-coded 
node numbers. For instance, if you perform a directional_exchange() 
from node == 3 == 011 in the 3-cube along direction == 4 == 100, this 
is the same as performing a coordinated send() and receive() combina- 
tion with node (011 XOR 100 == 111 == 7). Care is taken to make sure 
that deadlock does not occur. 


PREREQUISITE: initialize_hypercube() 
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INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
“comm.h" 
CALLS: pow2() “mathx.h" 
receive() 
send() 
CALLED BY: 
PARAMETERS : 
int node the number of the node calling this function 
int dim the dimension of the hypercube 
int direction as described above (1, 2, 4, 8, etc.) 
char *ibuf a pointer to the head of the incoming message 
char *obuf a pointer to the head of the outgoing message 
long len the number of bytes to be passed 
eee 
a7 


#ifdef PROTOTYPE 


void directional_exchange(int node, int dim, int direction, 
char *ibuf, char *obuf, long len); 


telse 
void directional_exchange(); 


#tendif 


tb 
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401 /* ------------- scsc===== FUNCTION DECLARATION =========--------------- 


PURPOSE: To receive from a prescribed direction. The direction is 
as described in directional_exchange() above. 


PREREQUISITE: initialize_hypercube() 


INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
“comm.h" 


CALES: pow2() “mathx.h" 
receive() 


CALLED BY: 
PARAMETERS: 


int node the number of the node calling this function 
int dim the dimension of the hypercube 

int direction direction to receive from 

char *buf a pointer to the head of the message 

long len the number of bytes to be passed 


s #ifdef PROTOTYPE 


void directional_receive(int node, int dim, int direction, 
char *buf, long len); 


telse 


void directional_receive(); 


#tendif 
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PURPOSE: To send in a prescribed direction. The direction is as 
described in directional_exchange() above. 


PREREQUISITE: initialize_hypercube() 


INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
“comm.h" 
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CALLS: pow2() “mathx.h" 
send() 
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466 

467 

468 #ifdef PROTOTYPE 

469 

470 void directional_send(int node, int dim, int direction, 

47} char *buf, long len); 

472 

473 #else 

474 

475 void directional_send(); 

476 

477 #endif 

478 

479 

480 

481 

482 

483 [# -----n meen nn Ssssss=s= FUNCTION DECLARATION Sssssssts--------------- 
454 
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499+ / 

500 


CRLEED BY: 
PARAMETERS: 


int node the number of the node calling this function 
int dim the dimension of the hypercube 

int direction direction to send to 

char *buf a pointer to the head of the message 

long len the number of bytes to be passed 


+ + # & © &© &#@& & & & & & & 


PURPOSE: To give the Hamming distance between 1 and j. 
INCLUDE: "“comm.h'"’ 

CALLS: sizeof () 

CALLED. BY. 

PARAMETERS: int i, j} the numbers 


RETURKS: (int) the Hamming distance(i,j). That is, the number of 
ones in the binary exclusive OR (i XOR j). 
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501 #ifdef PROTOTYPE 


502 

503 int hamming_distance(int i, int j); 

504 

505 #else 

506 

507 int hamming _distance(/* int i, int j */); 

508 

509 #endif 

510 

ail 

512 /* -o-e oe - ee SSSszses= FUNCTION DECLARATION =========------------- 
513 

514. * PURPOSE: The initialize_hypercube() function creates the hypercube 
515 * and performs the required setup for communications. It 

516 * must be completed before you expect to communicate. On the iPSC/2, 

517 * ONLY the host code should call this function. For transputer implemen- 
518 * tations every node should call it (in addition to the root node). This 
519 * is prerequisite to most of the other functions in this file. The basic 
520 * requirements for this function are so different (machine dependent) 

521 * that there are two versions: one for the transputers and one for the 
522 * iPSC/2 machine. 

523 + 

524 * INCLUDE: “comm.h" 

525 * 

526 * CALLS: attachcube() (Intel iPSC/2 C Library) 

527 * calloc() 

525 * free() 

529 * getcube() (Intel iPSC/2 C Library) 

530 * linkin() 

531 * linkout () 

532 * load() (Intel iPSC/2 C Library) 

533 * malloc() 

534 * printf () 

535 * setpid() (Intel iPSC/2 C Library) 

536 * sizeof () 

537 * strcpy() 

538 * 

539 * CALLED BY: 

540 * 

541 * PARAMETERS: In both cases, the desired dimension of the hypercube is 
542 * passed in as the first argument. After this, the functions 
543 * are quite different. 

544. * 

545 * ISN I SI SSS SS SS 
546 * 

547 * char *nodecode A pointer to the filename of the nodecode is 
546 * required so that the function can load the node 
549 * progran. 

550 * 
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551 * (2) traénaputerd, --- eee nnn nn 
552 * 

553. * Channel *ic((CUBESIZE + 1)] This is the incoming channel list. 
554 * You must declare it globally. Let CUBESIZE be the number of 
S55) * transputers in the hypercube. Then ic(] is a vector of length 
556 * (CUBESIZE + 1). The indexing is such that (ic{n) == C), where 
557 * nis some neighbor and C is the incoming Channel* from n. For 
558 * instance, if node k finds that ic{n]) == LIWK1IN then node k 

559 * knows to receive messages from node n via LIWKiIN. The element 
560 * ic(CUBESIZE] holds the channel for the root node (if any). 

561 * ic{n] == WULL means that there is no connection to node n. 

562 * 

563 * Channel *oc{(CUBESIZE + 1)] is the outgoing channel list. It 
564 * is completely analogous to ic[} except that it will hold 

565 * LINKOOUT, LINK1OUT, LINK20UT, or LINK30UT for the appropriate 
566 * node index. Your only obligation is to define these lists as 
567 * globals in the manner shown. The Channel pointer elements will 
565 * be filled in by initialize_hypercube(). 

569 * 

570 * RETURNS: The iPSC/2 version of the function returns a pointer to the 
571 * name of the cube. In the transputer environment, the cube- 
i alg name has no meaning, so a void function suffices. For the 
7 lg transputer environment, the single most important task that 
574» initialize_hypercube() performs is the filling of ic[] and 
io. oc{]. These vectors are used by most of the other communi- 
576 * cations functions. 

577 

5738 eee SS SS SS SS SE SS SS SSS SS SSS SSS SSS SS SSS SSS SS SSS Seer rrr srr 4---— 
579 / 

580 

581 


582 #ifdef TRANSPUTER 

584 void initialize_hypercube(int dim); 

586 #else 

588 char *initialize_hypercube(/* int dim, char *nodecode */); 


590 #endif 


595 /* -------------========= FUNCTION DECLARATION =========--------------- 


RURROSE: This function, called from any node in the hypercube, 
returns the dimension of the smallest hypercube containing 
that node. 
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601 * INCLUDE: “comm.h" 

602. * 

603 * CALLS: pow2() “mathx.h" 

604 * 

605 * CALLED BY: 

606 * 

607 * PARAMETERS: int node the inquiring node 

608 * 

609 * RETURNS: For an n-cube containing P==2"(n) processors, this function 
610 * is designed to work for nodes numbered 0 through (P-1). If 
611 * the function is called from the root (host) node, there is no guarantee 
612 * as to the returned value. If it is called by a valid node, it will 

613 * return the dimension of the smallest hypercube containing that node 

614 * number. For instance least_dimension(0) == 0, least_dimension(1i) == 1, 
615 * least_dimension(2) == 2, least_dimension(3) == 2, and least_dimension 
616 * (8) == 4. 

617 * 

6138 Om a a ww wn a a FS SS SS SSS BSB SSS SSS HK SSS SSS SS SSS SSS SSS SS SSS SS a Sew ew ew en wn ween en ee 
619 */ 

620 

621 

622 #ifdef PROTOTYPE 

623 

624 int least_dimension(int node); 

625 

626 #else 

627 

628 int least_dimension(/* int node */); 

629 

630 #endif 

631 

632 

633 

634 

635 /* -------------SSSSSSsae FUNCTION DECLARATIONS ==S=5=52-------——__ oe 
636 

637 PURPOSE: The receive() and send() functions declared below provide 


communication to (from) a buffer pointed to by buf. The 
volume of material to send (receive) is indicated in bytes by the len 
argument. The destination (origin) is given by the first argument, 
using a valid node number. Suppose you have an n-cube established upon 
a system with p == (2°n) node processors. Then you should refer to the 
nodes of the hypercube by their node number, which is a Gray coded 
value in the range [ 0, (p-1) J]. If you are at the root, of course, 
you may not communicate with the root (at least not with these func- 
tions); but if you are at one of the nodes of the hypercube, you may 
communicate with the root by using myhost() as the origin (or destina- 
tion) of your message. The macro given above makes myhost() available 
on the transputers. 
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* Transputers or iPSC/2? The type parameter is only used in the implied 
* sense with the iPSC/2 implementation [ it becomes type or typesel for 
* csend() or crecv() ]. For transputer implementations, type MUST BE set 
* equal to the number of nodes in the hypercube (e.g., p in the example 
* above). I have called this ‘cubesize’ in most of my references. 
bs 
* PREREQUISITE: initialize_hypercube() 
* 
* INCLUDE: <conc n> (Logical Systems C, version 89.1) 
* “comm .h" 
* 
* CALLS: ChanIn() (Logical Systems C, version 89.1) 
* ChanOut () 
* crecv() (Inted iPSC/2 C Library) 
. csend() 
* 
* CALLED BY: 
* 
BR meee ne ee eee sesssssssss== CAUTION 2=2==222===22222==--------------- 
* 
Make sure type == cubesize in the transputer case (see the note above)! 
% o--- ~~~ + + ee sseessssssssssssssesssssssserssessseso2--------------- 
*/ 


#ifdef PROTOTYPE 
void receive(int origin, char *buf, long len, long type); 
void send(int destination, char *buf, long len, long type); 
#else 
void receive(/* int origin, char *buf, long len, long type */); 
void send(/* int destination, char *buf, long len, long type */); 


tendif 


/* S lntentantantententetontententetedt ttt FUNCTION DECLARATION Sssssssss------- ---- -- = - 


BURPOSE: This function is called from the nodes to submit a message 
to the next lower dimension. If 1t 1s called from the host 
(root) it has no effect. When it is called from node zero, the trans- 
mission is directed to the root/host. When called from any other node, 
the information in buf is passed to the proper node in the next lower 
dimension. The lower dimension must have an accepting coalesce() or 
other receiving function [ coalesce() and submit() are meant to be used 
in a balanced fashion, where each submit() or group of submit()’s in 
one dimension is matched by a coalesce() in the next lower dimension ]. 
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701 * 

702 * PREREQUISITE: initialize_hypercube() 

703. * 

704. * INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
705 *® *comm.h" 

706 «= 

707 -*. “CALLS: least_dimension() 

708 = * pow2() “*mathx.h" 

709 send() 

10. * 

711. * CALLED BY: 

712 * 

713 * EXCEPTIONS: Again, we have the hybrid hypercube in the transputer case 
714. * (see many comments above). The general rule is changed in 
715 * this case since node 1 submit()s to the root and not node 0. This is 
716 * the only change. 

717. * 

718 * SPECIFICS: If you need to determine exactly where a submit() will go, 
719 ® you can figure it out in the following manner [ with the 
720 * obvious EXCEPTIONS (the previous paragraph) ] 

721 * 

722 * Suppose you are ‘at’ node i in an n-cube (p processors = 2°n). You 
723 * must submit() information to the (unique) node, j, that satisfies two 
724 * requirements: 

725 =* 

726 «* (1) hamming_distance(i, j) == 1 

727 = 

728 =«* (2) least_dimension(i) == (least_dimension(j) + 1) 

729 = 

730 * So, for instance, consider a 4-cube where i == 12. It should be fairly 
731 *® easy to see that j will be node 4. This is because these two nodes are 
732 * adjacent and they are one dimension apart in the cube (i.e., node 4 
733 * first appears in a 3-cube and node 12 first appears in a 4-cube). 

734 * 

735 * PARAMETERS: 

7305) = 

737 int node the sending node 

738 = int dim the dimension of the hypercube 

739 char *buf a pointer to the head of the message 

740 * long len the number of bytes to be passed 

741 * long type the type of the message (iPSC/2 applications only), or 
742 * cubesize in the transputer case. 

743° «* 

744 i teaheetenten teeta ete t+ +t + 111-4 1-1-1344 1-4-4 1-1-3341 4-4 4-11-3455 3-5-4 5-4 oe ee 
745 #/ 

746 

747 

748 #ifdef PROTOTYPE 

749 

750 void submit(int node, int dim, char *buf, long len, long type); 
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752 #else 
754 void submit(/* int node, int dim, char *buf, long len, long type */); 


756 #endif 
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[* -----------~-===5====== PROGRAM INFORMATION ==========------------- 
* 
* SOURCE ; complex.h 
* VERSION : 136 
* DATE : 09 September 1991 
* AUTHOR: Jonathan E. Hartman, U. S. Naval Postgraduate School 
* 
Bo eee ee a a ee eee ssssssss REFERENCES Sesssssssssssss5--------- == 
* 
* [1] Goldberg, David. ‘‘What Every Computer Scientist Should Know About 
~ Floating-Point Arithmetic’’. ACM Computing Surveys, Vol. 23, 
* No. 1, March 1991. 
* 
* 
Ho eee m ene n nnn ne sSssssess=s=== DESCRIPTION ==============5=------------- 
* 
* This file contains the definition of Complex_Type and declarations of 
* functions that perform operations with complex numbers: 
* 
* cadd() 
‘ cdiv() 
. cmul () 
+ csub() 
: Im() 
. Re() 
* 
ee OP en Pe SS Se eee ee See Se Se eee See SS SSeS See Se see SS SSS SS Se SS See Se ee eee eee ees 
«/ 
[8 annem - eee sesessas TYPE DEFINITION toss ssrsrat------------ * / 


typedef struct { 


double x, /* yreal part +/ 
y; /* imaginary part */ 


} Complex_Type; 


/* lene eet Reelin St — SSS FUNCTION DECLARATION satrtsstsess-~------------ 


* PURPOSE: To add two complex numbers, z1 and 22, and place their sum 
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INCLUDE: 


PARAMETERS: 


EXAMPLE: 


* *# *&# &© &#£ &# &@ #8 & & & 


in the Complex_Type ’*sum’. 
‘complex .h”’ 


The parameters give the two operands z1 and z2, and a 
pointer to the result, sun. 


Complex_Type 21, 22, 23; 


€add(z1.)z2, w@zs). 


#ifdef PROTOTYPE 


void cadd(Complex_Type z1, Complex_Type 22, Complex_Type *sum); 


telse 
void cadd(); 


#endif 


PURPOSE: 


ALGORITHM: 


INCLUDE: 


PARAMETERS : 


EXAMPLE: 


#ifdef PROTOTYPE 


=sSSSs==S== FUNCTION DECLARATIO¥ ssssesssss------------- 


To divide two complex numbers, (z1 / z2), and place the 
result in the Complex_Type ’*quotient’. 


The code uses Smith’s formula (page 25 of [1]) to perform 
the division. 


“complex.h" 


The parameters give the two operands zi and z2, and a 
pointer to the result, quotient. 


Complex_Type zi, z2, 23; 
Edav(zi. 22, z3) - 


— oe eee ee ewe we oe ee ee ee ee ee ee ee eee ee ee ee ee ee ee ee 
_—— 2 oe oe ee oe oe ow oe oo oe oe oe oe we we we we ee eo oe oe ee ee we oe oe oe om oe om om oe ae oe ee ee ee ee 2 a a ee ee 


101 void cdiv(Complex_Type z1, Complex_Type z2, Complex_Type *quotient); 

102 

103 #else 

104 

105 void cdiv(); 

106 

107 #endif 

108 

109 

110 

111 

nT i 

113 /* we wm me mmm em He SSS SSeS FUNCTION DECLARATION Srsssssssre------ -- -- - = 
114 
115 
116 
lz 
118 
119 


* 

* PURPOSE: To multiply two complex numbers, zi and z2, and place their 

* 

* 

* 

* 
120 * PARAMETERS: The parameters give the two operands zi and z2, and a 

* 

* 

* 

* 

* 

* 


product in the Complex_Type ’*product’. 


INCLUDE: “complex.h" 


121 pointer to the result, product. 
129 
123 
124 
125 
126 
127 ww i ee eae eae ae eae eae aa ea aa ae eee ee ee ee ee ee ee em ee ee eee eee 
128 */ 

129 

130 

131 #ifdef PROTOTYPE 

132 

133 void cmul(Complex_Type zi, Complex_Type z2, Complex_Type *product) ; 

134 

135 #else 

136 

137 void cmul(); 

138 

139 #endif 

140 


EXAMPLE: Complex_Type 21, z2, 23; 


emul (Zi 22 was) 


145 /® corer ene --sesssssasa FUNCTION DECLARATION sssssssrss------------- 


* 

* PURPOSE: To place the difference of two complex numbers, (z1 - 22), 
148 * into the Complex_Type ’*difference’. 

* 

* 


INCLUDE: ‘*complex.h" 
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159 RB ww mw wm wm mw — Bee eee ee ee ee ee SS FS SK SS SSS SS SSS SS HK SSS SSS SSS SS SS SSS SS Se eee ses esses e-o= 
160 #/ 

161 

162 

163 #ifdef PROTOTYPE 

164 

165 void csub(Complex_Type zi, Complex_Type 22, Complex_Type *difference) ; 
166 

167 #else 

168 

169 void csub(); 

170 

171 #endif 

172 

173 

174 

175 

176 

a -———-------- -=====>==== FUNCTION DECLARATION $==========----------~-- 
176 
179 
150 
181 
182 
183 
154 
155 
186 
157 
188 
1BQ ee eee a a Sa a a a a a a a SS SS SS Fr rrr 
190 #/ 

191 

192 #ifdef PROTOTYPE 


PARAMETERS: The parameters give the two operands 2i and 22, and a 
pointer to the result, difference. 


EXAMPLE: Complex_Type z1, 22, 23; 


csub(zi, z2, &2z3); 


+ + & & & & & 


PURPOSE: To return the imaginary part of a complex number, 2Z. 

PARAMETERS: The complex number, z, is passed into Im(). 

RETURKS : The lmaginary part of z as type double; that is a real 
number y 80 that y * sqrt(-1) [or iy] is the imaginary part 


of z. 


EXAMPLE: y = Im(z); 


* + + & #& & & & & & 


193 

194 double Im(Complex_Type 2); 
195 

196 #else 

197 

198 double Im(); 

199 

200 #endif 
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201 
202 
203 
204 
205 
206 /* -------------========== FUNCTION DECLARATION $==========------------- 
207 
208 
209 
210 


x 

* PURPOSE: This function returns the real part of a complex number, 2. 

* 

* 
211 * 

* 

* 

* 

* 


PARAMETERS: The complex number, z, is passed into Re(). 
212 RETURMS : The real part of z as type double. 
213 
214 
215 
216 Om aw nnn er Kr KH SSS SS SS SSS SSS SSS SSH SSS SS SS SH SSSSSSSSSSSSSSS525S—---] eee - = 
217 */ 

218 

219 

220 #ifdef PROTOTYPE 

a7 

222 double Re(Complex_Type 2); 

223 

224 #else 

225 

226 double Re(); 

227 

228 #endif 

229 


EXAMPLE: x = Retz 


eee = —————— ~~~ ========== PROGRAM INFORMATION. =====5===S5--~---------- 
oe 

Some SOURCE : complex.c 

4 * VERSION : 1.6 

5 * DATE ; O9 September 1991 

er AULOOR Jonathan E. Hartman, U. S. Naval Postgraduate School 

toes) DETAILS :; See "complex.h". 

8 

y a a a a a a an nr a eK SS SSS SSS SSS SSS SSS SS SS SSS SS SSS SSS SSS SSS S55 S55 5550000 cCCCcCCcCcC---= 
10 */ 


12 #include <stdio.h> 
13 #include "complex.h" 


22 #ifdef PROTOTYPE 

24 void cadd(Complex_Type zi, Complex_Type z2, Complex_Type *sum) 
26 #else 

28 void cadd(z1, z2, sum) 

30 Complex_Type zi, 

31 225 


32 *Sun; 


34 #endif 


35 { 


37 Bima xe = Zi.X + -Z2.x; 
38 eum->y = 21.y + Z2.y; 


40 } 
41 /* End cadd() --------------------------------------------------------- */ 


47 /*# ------------========= FUNCTION DEFINITION =========------------ +/ 


48 


50 #ifdef PROTOTYPE 


void cdiv(Complex_Type z1, Complex_Type z2, Complex_Type *quotient) 
Helse 
void cdiv(z1, z2, quotient) 
Complex_Type zl, 
225 
*quotient ; 
#endif 
{ 
double d; 
if (fabs(z2.y) < fabs(z2.x)) { 
d = /(22:y / 22.1); 
Quotrent->x = s((21.x + zl.y *9d)/(22 e+ 22ey -*5e))), 
qguotient->y = ((z1.y ~- zi.x * d)/(z2.x + z2.y * d)); 
} 
else { 
a= 22,5 -/ z2-)), 
quotient->x = (( z1.y + z1.x * d)/(z2.y + z2.x * d)); 
Guotient—>y = ((-z1.z + 2l.y © d)/(227) +°22-2 +20), 
} 
} 
/* End civ) 999999999999 er rn nnn nn nena an== * / 
[* ------------==SSESz== FUNCTION DEFINITION SssssSsS55------------ * / 
#ifdef PROTOTYPE 
void cmul(Complex_Type z1, Complex_Type z2, Complex_Type *product) 
#else 
void cmul(z1, 22, product) 
Complex_Type 21, 
Zar, 
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17 
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ze 
123 
124 
125 
126 
127 
128 
129 
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*product; 

#endif 
{ 

Meoaict-ox = (2toxretoz72 x — 219% * z2.y¥); 

Meoduct—->y = (zi.x * z2.y + z1.y * z22.x); 
} 
/* End cmul() ---------------- 2-22-2222 - - -- -  -  - - - - - - -- = - -- - - = --- + / 
/* =< SS SS SS SS FUNCTION DEFINITION Se a a ee ee eee * / 


#ifdef PROTOTYPE 


void csub(Complex_Type z1, Complex_Type z2, Complex_Type *difference) 


#else 
void csub(z1, 22, difference) 
Complex_Type 21, 
Za. 
*difference; 
#endif 
{ 
Gifference->x = Z1.x - 22.x; 
gitference->y = z1.y - 22.y; 
} 
TEL SE I + / 
[* ------------=SSS====5 FUNCTION DEFINITION mossssssa----~----- -- * / 
#ifdef PROTOTYPE 
double Im(Complex_Type 2z) 
#else 
double Im(z) 


complex.c 


151 
152 
153 
154 
155 
156 
157 
158 
159 
160 
161 
162 
163 
164 
165 
166 
167 
168 
169 
170 
171 
1i2 
173 
174 
175 
176 
| Were 
178 
179 
180 
181 
182 
183 
184 
185 
186 
187 
188 


Complex_Type 2z; 


#endif 
au 


return(z.x); 


} 
/* End Im() ----------------------------------------------------------- «/ 
[* pore rn nn ee SSSSaea FUNCTION DEFINITION eee re r= - - - - * / 


#ifdef PROTOTYPE 
double Re(Complex_Type z) 
#else 
double Re(z) 
Complex_Type 2Z; 


#endif 


} 
/* End Re() -oon rrr orn rrr rrr rrr errr rere rrr er rrr ner erence «/ 
/* a SSS ese EOF compl ex.c Sressessssssss--- ee -- - -- - */ 
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SOURCE : epsilon.h 

VERSION : veer 

DATE : O09 September 1991 

AUTHOR : Jonathan E. Hartman, U. S. MWaval Postgraduate School 


wen ene - See eee ssssscs=e REFERENCES Sots sttsstes2tirat ta - - - - - - - - - = 


[1] Gragg, William B. Personal conversations, course notes, and MATLAB 
code, 1991. 


we ee eee eee ssessss DESCRIPTION ssjesssse2e5s22------- ------ 
This file contains declarations of functions that determine the machine 
precision for a particular machine. The definition of epsilon is given 
below. 


— ee ee ee ee Serer FUNCTION DECLARATION soos stsssst----------- 
PURPOSE: To find the machine precision. The machine precision, eps, 

is defined as the largest number which satisfies: 

1.0 + eps == 1.0 

This program uses the type "double" which normally means an 8-byte 
(64-bit) floating-point number stored in the IEEE 754 double precision 
standard representation of [ 1 sign bit ][ 11-bit exponent J[ 52-bit 
mantissa/significand ]. 
INCLUDE: “epsilon.h" 
RETURNS: The value of epsilon (double). 


yy — — 2 oe oe oe oe oe ow om ow om ow ow ow 6 oe = Ow oe ee ee ee ee oe ee oe oe om oe om oe oe oe oe oe om om om oe om oe oe oe oe oe — 
— o— 2S. 222 Le ee ee ee eS Be es SB SE Se eS SS SO Se SS Sw SS Sw eS eS Sw Sw ew ew eS Se ee we sw ews ew ee ee ee es ee —— 


Sie 


53 double epsd(); 


59 /* -------------SSSSSsses= FUNCTION DECLARATION ssssssssss------------ 


* 
* PURPOSE: This function is identical to epsd() except that it returns 
* type float. Wote: The values returned may be identical, 
* probably reflecting C arithmetic done in type double 
* regardless of the ultimate type returned. Anyway, this 
65 * function does everything using type float. 
* 
* 
* 
* 
* 


INCLUDE: “epsilon.h” 


RETURNS: The value of epsilon (float). 
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— ee ee ee eee SSeS 5 = —-—— PROGRAM INFORMATION sSrsssssasss------------- 
SOURCE  : generate.h 

VERSION : et 

DATE : O9 September 1991 

AUTHOR: Jonathan E. Hartman, U. S. Waval Postgraduate School 


Ce et ot REFERENCES sesesessaeseseces------- ------ 


[1] Gragg, William B. Personal conversations, course notes, and MATLAB 
codes, 1991. 
we een ne ee ee ee Se Sees DESCRIPTION Sssssssssssssss-—en- ee = 
Declarations of matrix and vector generation/initialization functions. 
Ne tt tt LIST OF FUNCTIONS sresssssssess------------- 
hilbert() 
identity() 
initial_permutation_vector() 
mxrand() 
wilkinson() 
zeros() 
ee ee ee ee esas FUNCTION DECLARATION ssseseseaessase---- - ------- 
PURPOSE: This function generates a Hilbert matrix of the specified 
size. The function takes care of memory allocation, s0 
the caller does not need to do this. The definition used 
for a Hilbert matrix is (for rows and columns numbered from 
1) that the element at the (i,j) position has the value 
oli/aGenects j scum) ) . 
INCLUDE: “allocate.h" 
“matrix.h'" 
GALLS: matalloc() 
CALLED BY: 


PARAMETERS: The parameters tell the size of the desired matrix. 


RETURNS: On success (i.e. no allocation problems), hilbert() returns 


31 
52 
93 
94 
55 
56 
37 


58 
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EXAMPLE: 


the allocated matrix filled with the values as described. 
A WULL return value flags an allocation failure. 


Double_Matrix_Type *A = hilbert(5, 7); 


59 #ifdef PROTOTYPE 


Double_Matrix_Type *hilbert(int rows, int cols); 


telse 


Double_Matrix_Type *hilbert(); 


#endif 
/* wee ene ee eee SSeS Sees FURCTION DECLARATION sresssessss—--- - -- -- ---- 
* 

* PURPOSE: This function generates an Identity matrix of the specified 
* size. The function takes care of memory allocation, so 

* the caller does not need to do this. 

* 

* INCLUDE: ballOcate. n. 

* (Matrix .n”’ 

* 

* CALLS: matalloc() 

* 

* CALLED BY: 

* 

* PARAMETERS: The parameters tell the size of the matrix. 

* 

* RETURNS: On success (i.e., no allocation problems), identity() 

* returns the allocated matrix filled with the ones on the 

* diagonal. <A NULL return value flags an allocation failure. 
* 

* EXAMPLE: Double_Matrix_Type *A = identity(5, 7); 

* 

bleed tt too 
*/ 


#ifdef PROTOTYPE 


Double_Matrix_Type *identity(int rows, int cols); 


265 


147 
148 


#else 
Double_Matrix_Type *identity(); 
#endif 
/* eae eee mee we SSS ZZEZZZ=Z=: FUNCTION DECLARATION ee eee 
* 
* PURPOSE: To initialize a permutation vector, p[]. This function 
* performs allocation for p[], assuming that it must contain 
* nh integer elements. Additionally, the function assigns 
* values p{jJ = j for all 0 <= j <n. If allocation fails, p 
’ will be WULL upon return. 
* 
* INCLUDE: “allocate.h" 
* 
*) CALLS: intvecalloc() 
1 
* CALLED BY: 
* 
* PARAMETERS: The size of the vector, n. 
* 
* RETURNS: (A pointer to) The vector. 
bs 
ae ee ee ee a ae ae ae ae SSS SS SS SSS SSS SES SSS SS HSS SSS 25 S—— = ee ee ee = 
+/ 
#ifdef PROTOTYPE 


int *initial_permutation_vector(int n); 


#else 


int *initial_permutation_vector(); 


#endif 


/* wee — — ~~ Serer FUNCTION DECLARATION Sescscsrsssss——-- - - - - - - - - 


* 


PURPOSE: 


This function generates a matrix whose elements are pseudo- 
random numbers (generated by Icdrand() in mathx.c). 


269 


151 
152 
153 
154 
155 
156 
157 
158 
159 
160 
161 
162 
163 


INCLUDE: “allocate.h" 
“mathx.h” 
*matrix.h" 


CALLS: ledrand() 
matalloc() 


CALLED BY: 

PARAMETERS: The parameters tell the size of the matrix. 

RETURNS: On success (i.e., no allocation problems), mxrand() returns 
the allocated matrix filled with the random values. A WULL 


return value flags an allocation failure. 


EXAMPLE: Double_Matrix_Type *A = mxrand(5, 7); 
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173 #ifdef PROTOTYPE 

175 Double_Matrix_Type *mxrand(int rows, int cols); 
177 #else 

179 Double_Matrix_Type *mxrand(); 


181 #endif 


187 /* ll ett FUNCTION DECLARATION = abbot ee ed 


PURPOSE: This function generates a Wilkinson matrix of the specified 
size. The function takes care of memory allocation, so 
the caller does not need to do this. The definition used 


* 
= 
* 
* 
192 * for a wilkinson matrix 1s: ones along the diagonal, ones 
193 * along the rightmost column, zeros in the upper right 
194 * triangle, and (-1)’s in the lower left triangle. 
195 * 
196 * et 1 Al 
197 * [=i i ij 
198 [-1 -1 1 1 ] 
199 [-1 -1 -1 1 a 
* 


201 
202 
203 
204 
205 
206 
207 
205 
209 
210 
211 
212 


248 


* * © © © © #@ © &@ @ @ #& &© &# 8&8 8 8 HH & 


INCLUDE: 


CALLS: 


CALLED BY: 


PARAMETERS: 


RETURNS: 


EXAMPLE: 


“allocate.h" 
“matrix.h" 


matalloc() 


The parameters tell the size of the matrix. 


On success (i.e. no allocation problems), wilkinson() 
returns the allocated matrix filled with the values as 
described. On (allocation) failure, wilkinson() returns 
BULL: 


Double_Matrix_Type *A = wilkinson(5, 7); 


#ifdef PROTOTYPE 


Double_Matrix_Type *wilkinson(int rows, int cols); 


#else 


Double_Matrix_Type *wilkinson(); 


#endif 
/* wee oe - ee ee eS Sees FUNCTION DECLARATION Sree scrtsersste------ - 
x 
* PURPOSE: This function generates a matrix of the specified size, 
¥ where all of the entries are zero. 
* 
* INCLUDE: "allocat é.h" 
* “matrix.h" 
* 
* CALLS: matalloc() 
* 
* CALLED BY: 
x 
* PARAMETERS: The parameters tell the size of the matrix. 
* 


tO 
~~] 
——a 


RETURNS: On success (i.e. no allocation problems), zeros() returns 
the allocated matrix filled with zeros. On allocation 
failure, zeros() returns BULL. 


251 
252 
253 
254 
255 
256 
257 5 eee 
258 */ 

259 

260 #ifdef PROTOTYPE 

261 

262 Double_Matrix_Type *zeros(int rows, int cols); 

263 

264 #else 

265 

266 Double_Matrix_Type *zeros(); 

267 

268 #endif 

269 

270 


EXAMPLE: Double_Matrix_Type *A = zeros(5, 7); 


% + + & &% # 


to 
= 
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[#8 ------- SS SSSSsS=e PROGRAM INFORMATION SSssssssss------------- 
ae 
= SOURCE : io.h 
* VERSION : Pan Pe 
* DATE : 09 September 1991 
* AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 
* 
* 
% ------------- sorssssssssses DESCRIPTION SSsssssssccss------------- 
* 
* This file contains declarations of functions for matrix and vector 
* input/output. The matrix structures such as "Double_Matrix_Type" are 
* given in "matrix.h". 
* 
* The following parameters are common enough to justify a one-time 
* explanation here (and not with each occurrence below): 
- 
. width the width in which to print a value 
ae 
® aft the number of places to print after the decimal point 
* 
* 
% ewww ewe eee eS SS SE rts LIST OF FUNCTIONS Srttststtscstsrs----- - - 
* 
* answer() 
* £31] _matrix() 
* fread_matrix() 
* fwrite_matrix() 
* getint() 
* get_matrix_size() 
* pause() 
* printmd() 
* printvd() 
* printvi() 
* 
et ee 
«/ 
[i mmm enn nnn SsSssssse= MANIFEST CONSTANTS ========2=------------ « / 
#define LONG_AFT 8 
#define LONG_WIDTH 12 
#define SHORT_AFT 


#define SHORT_WIDTH 
#define STD_AFT 
#define STD_WIDTH 


ow om dN 


tO 
= 
WwW 


5) /® 2-2-2 ----- Ss SSsSsss== FUNCTION DECLARATION sesssssssa------------- 


58 


int 


PURPOSE: 


WOTE: 


INCLUDE: 


CALLS: 


CALLED BY: 


PARAMETERS : 


RETURNS: 


answer(); 


To get a yes or no answer from the user. 
This function includes the prompt “(y/n)? " so you do not 
have to include this in your query. There is no space 


before, two spaces after, and no newline (i.e. as shown). 


<stdio.h> 
er Wo Ye oa 


getchar() <stdio.h> 


void. 


(int) YES or WO (as defined in matrix.h). 


/* weer ee eee ee ee Se See sess FUNCTION DECLARATION sotto stsss----- - - -- - - - 


* ££ &©& &© © &# &# & &# # & & & H& HH h6HhhUhHhhUhHhUH 


PURPOSE: 


PARAMETERS: 


INCLUDE: 


CAUTION: 


CALLS: 


A function which prompts the user for the pertinent data 
about a matrix and fills the structure provided with the 
appropriate information. That is, this function allows the 
user to input the values of the elements. 


A pointer to the structure containing the matrix to be 
filled. 


<stdio.h> 
ws 6 8 9 Ya 


This function ASSUMES that the “rows" and "cols" fields 
have been correctly assigned by something like matalloc() 
{see “allocate.h"] and makes no effort to enter a value in 
those fields of the matrix structure. 


Q) 


Lo, 
~] 
x 


101 * CALLED BY: 

102 * 

103 * PARAMETERS: The parameters tell the size of the matrix. 

104 * 

105 * RETURNS: The matrix associated with A is operated on during the 
106 * execution of the function, and the result is available 
107 * upon return. 

108 * 

109 * EXAMPLE: if (!fill watrix(as)).... 

170 0Cls«&® 

111 SG wm mee ew ww ww ww ee eee Se a SS eS Se SS SS SS SS SS SS SS SS SS SSS SS SSS SS SS Sew wm wm wm wm ew wm emer ee = 
112. */ 

113 


114 #ifdef PROTOTYPE 

115 

116 void fill_matrix(Double_Matrix_Type *A); 
ae 

118 #else 

119 

120 void fill_matrix(); 


ry 


122 #endif 


126 /*® ------------ ss ssssssea FUNCTION DECLARATION Ssssssssss5------------- 


PURPOSE: A function which reads data from a file and stores it in 
the matrix of A. This function takes care of matrix 
allocation for the caller. 


INCLUDE: <stdio.h> 
sili af 2 We) « ad 


CAUTION: This function ASSUMES the file has been stored in the 
format described in "matrix.fmt". 


CALLS: fgets() 
fscanf() 
rewind() 


139 


CALLED BY: 
PARAMETERS: The pointer to the matrix structure and the file pointer. 
RETURKS : 1 on success and O on any sort of failure. 


1438 ae ae oe ae ae eee ae aes a ee ee ee a eS Se Se SS See Se eS SSS SS SSS SSSSSSSSS Seen eeeeeeee 


tO 
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151 #ifdef PROTOTYPE 

152 

153 int fread_matrix(Double_Matrix_Type **A, FILE *fp); 
154 

155 #else 

156 

157 int fread_matrix(); 

158 

159 #endif 


163 
165 /*® -o------- --- SS SSSSSS== FUNCTION DECLARATION =s=>SS=S====------------- 


PURPOSE: A function which writes data from A->matrix[J) [J to a file 


168 pointed to by fp. 


INCLUDE: <stdio.n-> 
Biro ee ¢ ble 


ASSUMPTION: The caller has already performed fopen() on fp for the 
“w" (write) mode. 


CALLS: fprintf() 
rewind() 


CALLED BY: 


PARAMETERS: A is a pointer to the structure which contains the matrix. 
fp 1s a FILE pointer. 


RETURNS: 1 on success and O on failure. 


—" 
| 
On 
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189 

190 #ifdef PROTOTYPE 

191 

192 int fwrite_matrix(Double_Matrix_Type *A, FILE *fp, int width, int aft); 
193 

194 #else 

195 

196 int fwrite_matrix(); 
197 

198 #endif 

199 

200 


bo 
~~] 
>>) 


201 [® eo m rr crn nn nn re Ssssssast FUNCTION DECLARATION sossssssss------------- 


Zid 


218 J 


240 
241 
242 
243 
244 
245 
246 
247 
248 
249 
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PURPOSE: 


INCLUDE: 


CALLS: 


CALLED BY: 


RETURNS: 


getint(); 


PURPOSE: 


INCLUDE: 


CALLS: 


CALLED BY : 


PARAMETERS: 


A function to get user input of a single integer. 


<stdio.h> 
Maen Hh’ 


flush ) 
scanf() 


The user’s integer. 


Saree aaa ae SSS SSS Sas SO Ses Sew esw s wO S SS ese Sees Sees ewe es es wees oe HF FSF Se Sf eee e2.-— 


ssssssssaa FUNCTION DECLARATION sssssess2=------------- 
A function to ask the user for the size of a matrix. 


<stdio.h> 
een. 


answer() 
fflushe) 
scanf() 


Pointers to the size of the matrix (m rows by n columns). 


#ifdef PROTOTYPE 


void get_matrix_size(int *m, int *n); 


Helse 


void get_matr 


#endif 


ix_size(); 


bo 
—] 
=] 


25) [# -ee----------ssesseseaa FUNCTION DECLARATION Sessesee2=--------~----+ 


263 


265 


268 


273 


276 
277% 
278 
at9 
280 
2381 
282 
283 
284 
285 
286 
287 
288 
289 
290 
291 
292 
293 
294 
295 
296 
297 
298 
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PURPOSE: 


INCLUDE: 


CALLS: 
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void pause(); 


=~ 
+ 
1 
1 
1 
1 
! 
1 
1 
1 
1 
1 
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PURPOSE: 


INCLUDE: 


CALLS: 


PARAMETERS: 


EXAMPLE: 


+ #4 #& #4 & # #4 & & #& & #& #&@ & & 8 


Press a key to continue! 


<stdio.h> 
ata 6 nh? 


fflush() 


getchar() 
printf () 


=ss==sS=== FUNCTION DECLARATION sosfcscscsssir------- 


This function provides a printout of the information stored 
in the structure A. 


<stdio.h> 
46 h? 


printf() 

A is the structure that contains the matrix to be printed. 
The width and aft values are described near the top of this 
file. The defaults are defined as manifest constants. 


Double_Matrix_Type *A = hilbert(7, 5); 


printmd(*A, LONG_WIDTH, LONG_AFT); 


#ifdef PROTOTYPE 


void printmd(Double_Matrix_Type A, int width, int aft); 


Helse 


void printmd(); 


#endif 


s 
of) 


[* oor c crore Ssssssssea FUNCTION DECLARATION =sssSsSsS5=------------- 
* 

* PURPOSE: This function prints the vector, v, of doubles. 

53 

* INCLUDE: <stdio.h> 

* UOg 

* 

= CALLS: printf () 

* 

7 CALLED BY: 

x 

* PARAMETERS: v is the vector. size is the number of elements in v(). 

* 

ee eww ew ewww — — HK MK TTT ST SSS STI PMS HSS SSS SSB SH SPP TPS SH TP SSS SSK SS SSS SSS SFSS——2—— 2] ee we KK 
* / 


#ifdef PROTOTYPE 
void printvd(double *v, int size, int width, int aft); 
#else 


void printvd(); 


#endif 
[* ---------n---=sSsssss== FUNCTION DECLARATION #==========------------- 
* 
* PURPOSE: This function provides a printout of the integer vector v. 
* 
* INCLUDE: <stdio.h> 
* pote gh) a 
* 
* CALLS: prints (> 
* 
* CALLED BY: 
* 
* PARAMETERS: v is a vector of size integers. 
* 
a ee SS SS SS SS SSS SPSS SSS SSS SSB SSS SSB SS SSS SSS SSSS Few nn me ee ew - 
sic 


#ifdef PROTOTYPE 


void printvi(int *v, int size, int width); 


a 


352 #else 
354 void printvi(); 


356 #endif 


280 
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[*% err enn nn Sessssssse PROGRAM INFORMATION SSSSS SESS some 
* 
* SOURCE =: mathx.h 
* VERSION : d2 
* DATE : O9 September 1991 
* AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 
* 
* 
Bo ee ee ee ee es sssssssscss REFERENCES SEsSssSssssssss---- en --- 
* 
* [1] Knuth, Donald E. The Art of Computer Programming, Volume 2: Semi- 
* numerical Algorithms. Addison-Wesley Publishing Company, 
* Reading, MA, 1969, pp. 9-24. 
* [2] Sedgewick, Robert. Algorithms, Second Edition. Addison-Wesley 
* Publishing Company, Reading, MA, 1988, pp. 513-514. 
* 
* 
Ho Hao mene HS SSS SSSSSSSSa DESCRIPTION SSS s==s—=--—- >> >= -=======- 
* 
* A small extension to the usual C <math.h>. 
* 
* 
Ho mewn nem eH ee eeSSsesssssss LIST OF FUNCTIONS ============------------- 
* 
* lcdrand() 
* lelrand() 
* multmod() 
* pow2() 
* 
Ce wee waa ae me SS ee See ee ee See Se ee Se Se Se ee eS Ss SS 2 SS SSS SSS S55 S55 Se. ee ee ee ee eS = 
*/ 
[& won renee ne nnn ssssssses= MANIFEST CONSTANTS SS SSSS Sli ier eee x / 
#ifndef EXIT_FAILURE 
#define EXIT_FAILURE = 
#endif 
#define START 1234567 /* starting value, Xo. See [1] */ 
#define MULT 31415821 /* multiplier, a. See [1] */ 
#define INCR 1 /* increment, c. See [1] */ 
#define SQRTM 10000 /* sqrt(m) */ 
#define MODULUS 100000000 /* modulus, m. See [1] */ 


to 
(8 2) 
— 


5] /* screen re en no SSssssses FUNCTION DECLARATION =========------------- 

PURPOSE: To calculate a pseudo-random number in the range [0, 1] 
using the linear congruential method. This function is a 
very simple application of lclrand(). It merely divides 
the value that lclrand() returns by the modulus, and 
returns the resulting double value. 

INCLUDE: *mathx.h” 

CALLS: lelrand() 

CALLED BY: mxrand() “generate.c" 

PARAMETERS: The parameters are identical to those for lclrand(). 

RETURNS: A pseudo-random double value in the range [ 0.0, 1.0 ]. 


EXAMPLE: double d; 


d = ledrand(START, MULT, INCR, SQRTM, MODULUS) ; 


fez) 
be 
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77 #ifdef PROTOTYPE 

79 double lcdrand(long Xn, long a, long c, long sqrtm, long m); 

81 #else /* iPSC/2 */ 

83 double lcdrand(/* long Xn, long a, long c, long sqrtm, long m */); 


85 #tendif 


9) /* -------------========= FUNCTION DECLARATION =========------------- 


PURPOSE: To calculate a pseudo-random number of type long in the 
range [0, (m-1)], where m is the argument for modulus. The 
algorithm uses the linear congruential method. This method 
is given in great detail in [1]. A shorter, algorithmic 
treatment is given in [2]. I have tested the function to 
be sure that it produces the ten numbers listed on page 513 
oni in 


© 
mn 
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101 INCLUDE: “mathx.h" 

102 

103 CAEES: multmod() 

104 

105 CALLED BY: Ilcdrand() 

106 

107 PARAMETERS: The notation comes from [1] (more-or-less). Xn is the 


starting value. a is the multiplier. c is the increment. 
sqrtm is the square root of m, which is the modulus. A 
negative value for any of the arguments is impossible and 
will invoke the defaults given among the manifest constants 
above. The starting value, Xn, is the exception. If you 
supply a nonnegative value, your value will be accepted as 
the starting value. Else, the starting value BEGINS at the 
default START and is changed each time the function is 
called (as long as the starting value argument, Xn, is 
negative). That is, Xn HAS MEMORY as long as your program 
is running. The other parameters are determined from call- 
to-call. 


108 
109 
110 
yt Ui 
112 
113 
114 
115 
116 
Owe 
118 


RETURNS: A pseudo-random long in the range [ 0, (m-1) J], where m is 
the modulus argument. 
EXAMPLE: This example illustrates the use of the default values: 
song 1, 


1 = lclrand(START, MULT, INCR, SQRTM, MODULUS); 
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134 #ifdef PROTOTYPE 

136 long lclrand(long Xn, long a, long c, long sqrtm, long m); 

138 #else /* iPSC/2 */ 

140 long lclrand(/* long Xn, long a, long c, long sqrtm, long m */); 


142 #endif 


149 


283 


151 /# -------------s==ss=s=== FUNCTION DECLARATION =========------------- 


193 


198 


PURPOSE: 


INCLUDE: 


CALES : 


GALLED BY; 


PARAMETERS : 


RETURNS: 


To calculate (a * b) mod m°2, while trying to avoid over- 
flow. This function is adapted from Sedgewick’s ‘mult’ 
function on page 513 of [1]. 


“mathx.h" 


lclrand() 
long a, b, m. 
long (a * b) mod m°2. 


> ee eee ee te ee ee ee ee ee i ee i ee ee ee ee ee ee ie ee ee ee ee ee ee ee ee ee ee ee om ow oo oe == 


#ifdef PROTOTYPE 


long multmod(long a, long b, long m); 


#Helse 


long multmod(/* long a, long b, long m */); 


#tendif 


/* 
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PURPOSE: 


INCLUDE: 


CALES: 


CALLED BY: 


PARAMETERS: 


ee a FUNCTION DECLARATION sorrsee------- + - 


To calculate the value of two raised to the (n) power. This 
function [unlike the macro POW2() given in macros.h) will 
handle the case where (n == 0). This function uses left 
shifts to achieve the result, so if you ask for too large a 
value, the result is not guaranteed. The value of n is 
ASSUMED to be a POSITIVE integer. 


“mathx.h" 


The desired power of two, n. 


284 


202 * RETURNS: The function returns the value of 2°(n). 


207 

208 #ifdef PROTOTYPE 

209 

210 long pow2(int n); 
212 #else 


214 long pow2(/* int n */); 


216 #Hendif 


to 
op) 
wt 


num-_sys.h 


oon nA oOo ff WH WH 
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SOURCE : num_sys.h 

VERSION : Lead 

DATE : O9 September 1991 

AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate School 


ee ee ee ee f2ertsssscscrsccrscat REFERENCES Srrostss tsi tree een eee ee ee ee 


[1] Goldberg, David. ‘‘What Every Computer Scientist Should Know About 
Floating-Point Arithmetic.’’ ACM Computing Surveys, Vol. 23, 
No. 1, March, 1991, pp. 6-48. 


(2] Hayes, John P. ‘‘Computer Architecture and Organization.’’ McGraw- 
Hill Book Company, New York, Second Edition, 1988, p. 196. 


wee ee ee ee ee ZZ DESCRIPTION Sesser rrrr srr —--- - ee ee 


The “num_sys" group of functions relate to number systems (e.g. binary, 
decimal, hexadecimal). 


binrep() 
binvec() 
hexrep() 
ieeerep() 


we wn a a a a ee ee SSE ESS SSS FUNCTION DECLARATION Sossssssese—- ee ee He ee - 


PURPOSE: To display the binary representation of a number. Given the 
parameters described below, binrep() prints the binary 
representation. For numbers of type double, type float, or 
type int; binrep() reverses the order of the bytes from the 
machine storage. This makes them more readily recognizable 
as [ SIGN J({ EXPONENT ][ MANTISSA ] for the floating-point 
types and orders the bytes in order of decreasing signifi- 
cance for the integers. 


INCLUDE: *num_sys.h" 


CALLS: 


256 


num_sys.h 


51 CALLED BY: 
52 
53 
54 
55 
56 
57 
58 
59 
60 


61 


PARAMETERS: The function needs to know what type of number you are 
sending in, so use the types given in matrix.h. The 
function understands TYPE_CHAR, TYPE_DOUBLE, TYPE_FLOAT, 
and TYPE_INT). It also needs a pointer to the_number. 


EXAMPLE: floatst; 


binrep(TYPE_FLOAT, &f); 
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65 #ifdef PROTOTYPE 

67 void binrep(int number_type, void *the_number) ; 
69 #else 

71 void binrep(); 


73 #endif 


a4 /* ween ne esses FUNCTION DECLARATION ssoscscscscscsss------------- 


PURPOSE: To expand the bits of the input into an array of integers. 
The array only holds zeros and ones, with each element 
representing a bit of the input number. 


INCLUDE: “num_sys.h" 
CALLS: 
CALLED BY: 


CAUTION: This function returns the bits AS THEY ARE IN THE MACHINE! 
Many machines store type double, type float, and type int 
so that their bytes are in an order that is the reverse of 
what you might expect. Of course, the bits within a byte 
are in the expected (msb...... lsb) order. 


PARAMETERS: The function needs to know what type of number you are 
sending in, so use the types given in matrix.h. The 
function recognizes TYPE_CHAR, TYPE_DOUBLE, TYPE_FLOAT, and 
TYPE_INT. It also asks for a pointer to the number. 


Qo 
ve) 
+ © ££ ££ &£ &£ &# H& & H H& HH Hh H OH OH h6hHhUhH!hUh Hh hUhHhUh Hh hUH 


RETURNS: A pointer to int. The function will take care of allocation 
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101 for this pointer, and it will f111 the array with the bits 


v = binvec(TYPE_FLOAT, &f); 


*x 
102 * of the number. For indexing purposes, you will probably 
103 * need to know how big this vector is. Multiply the 
104 * (sizeof(type you are sending in)] by 8 (bits/byte). That’s 
105 * how many elements will be in the returned vector of integer 
106 _* (bits). This pointer will be WULL if there was an alloca- 
107. * tion problen. 
108 * 
109 * EXAMPLE: 
110 * 
1516 fleatet; Assume that this takes 4 bytes * 8 bits 
112 * 
113 * int *v; To hold the bit vector of f (32 elements) 

*x 

* 

* 


120 #ifdef PROTOTYPE 

$22 int *binvec(int number_type, void *the_number) ; 
124 #else 

126 int *binvec(); 


128 #endif 


131 /* -------------SSSSSESs== FUNCTION DECLARATION sSssesssss------------- 
PURPOSE: To display the hexadecimal representation of a number. 
TACLUDE: “num_sys.h" 

CALLS: 

CALLED BY: 


PARAMETERS: The function needs to know what type of number you are 
sending in, so use the types given in matrix.h. The 
function recognizes TYPE_CHAR, TYPE_DOUBLE, TYPE_FLOAT, and 
TYPE_INT. It also needs a pointer to the number. 


EXAMPLE: freat I; 
147 
148 printf("The hexadecimal representation of %f is: ", f); 


hexrep(TYPE_FLOAT, &f); 


— 
bb 
— 
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154 #ifdef PROTOTYPE 

156 void hexrep(int number_type, void *the_number) ; 
157 

158 #else 


160 void hexrep(); 


162 #endif 


165 /*® -------------sssSSSSs=5 FUNCTION DECLARATION sSessssssss------------- 

PURPOSE: To display binary and IEEE representation of a number. This 
is nearly a tutorial function! It displays a binary repre- 
sentation of the number, and then breaks out the sign, 
exponent, and mantissa (or significand). Some terse trans- 
lation tips are also provided. 

INCLUDE: “num_sys.h" 

CALLS: 

CALLED BY: 

PARAMETERS: The function needs to know what type of number you are 
sending in, so use the types given in matrix.h. This 
function ONLY recognizes the floating-point types (i.e., 
TYPE_DOUBLE and TYPE_FLOAT). It also needs a pointer to 
the number. 

EXAMPLE: float f; 
printf("The IEEE 754 representation of “/f is: ", f); 


leeerep(TYPE_FLOAT, &f); 
189 


193 #ifdef PROTOTYPE 
195 void ieeerep(int number_type, void *the_number); 
197 #else 


199 void ieeerep(); 


tr 
ef) 
cam) 


201 #endif 
202 


) /# q----- n-ne ---Sssssss=== PROGRAM INFORMATION ==========------------- 
oa 

3 * SOURCE =: ops.h 

4 * VERSION : deal 

5 * DATE : O89 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. MWaval Postgraduate School 

Toe 

8 * 

Q 8 coer ennnn ne SSSSsssssssss=== REFEREWCES ======>========------------- 
C+ 

11 * [1] Golub, Gene H., and Charles F. VanLoan. Matrix Computations. The 
12 * Johns Hopkins University Press, Baltimore, 1989. 

13 * 

14 * 

15 Be rn nn nn a eS esssssss=e=== DESCRIPTION sossssssessss------------- 
16 * 

17 * The functions declared below perform matrix and vector operations. For 
18 * the sake of brevity, I will often use simple (MatLab-style) notation in 
19 * comments. For instance, x’ means x transpose (i.e. a row). Do not 

20 * confuse the comment shorthand with what is really happening in the 

21 * code. My goal is to get function specifications across clearly and 

22 * succinctly without excessive concern for implementation. Here area 

23 * few notes. 

24. =* 

25 * An operation preceded by a "." means “elementwise". For instance, 

26 * x .* y means the elementwise vector multiplication of x by y. That is, 
27 * the result would be some vector z like: 

28 «= 

oon Zz =sie xitlesyi1), xl2)*y (2... Mxlaleyin) © ) 
a0) * 
31 * If the operation appears without the preceding ".", it means the vector 
32 * operation. 
33 * 
34 * 
35 8 eee en - ee ee SS sSsssscces LIST OF FUNCTIONS Ssssrsssssss---------- - = 
36 

37 * cols() 

38 * dot_product() 

39 * matrix_product() 

40 * max_element() 

41 * normp() 

42 * outer_product() 

43 * rows() 

44 * swap_cols() 

45 * swap_rows() 

46 * vec_init() 

47 * 

48 i a ee a a aS Se Se Se ee ee ee a a a a a a ee eee 
49 #/ 

50 
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Vs wee eee ee ee ee eee FUNCTION DECLARATION Sass ssSs2S5—--------- - - - = 


+t + & + 


PURPOSE: 


INCLUDE: 


To return the number of columns in the matrix A. 
“ops ; nh" 


#ifdef PROTOTYPE 


#el 


int cols(Double_Matrix_Type *A); 


se 


int cols(/* Double_Matrix_Type *A */); 


3 #Hendif 


[/* -------------sSSSssee= FUNCTION DECLARATION ssssSsss5-------------- 


+ + © & © &# & &# & & # & & & & HH HH HH H+ HH H H H H HF H OF 


PURPOSE: 


PARAMETERS: 


INCLUDE: 


CALLS: 


CALLED BY: 


RETURNS: 


EXAMPLE: 


double 


static double x[] 


int 


answer 


Computes the dot product of the input vectors x and y which 
is defined in [1] (page 4). The dot product of x and y is 
x’? * y. 


The vectors x and y should be arrays of type double, each 
having “size” elements. 


"ops.h" 

N/A 

matrix_product() {see below] 

A double (scalar) value equal to the dot product x’ * y. 
The following example would conclude with answer == 10.0. 


answer); 


e 


{715020 nech Ome 
y{] 1.-3:055 2. Onno er 


] 
Size = 3; 


dot_product(x, y, size); 


= oe om 6 ee ow om om oo © © ee ee ee @ Se oe ow om om om om @® ce om om om ce me ce fee oe ie oe oe ee ae i i 
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#ifdef PROTOTYPE 


double dot_product(double *x, double *y, int size); 
7 #else 
double dot_product(/* double *x, double *y, int size */); 

#endif 
[* -------------=55SS===== FUNCTION DECLARATION ==========------------- 
x 

*) PURPOSE: To multiply matrices A and B, placing the product in C. 

x 

* INCLUDE: “ops oh’ 

® 

* CALLS: dot_product [see above] 

*x 

SEC AULED BY; 

*x 

* PARAMETERS: The parameters tell the size of the matrix. 

*x 

* RETURNS: SUCCESS if the matrices were compatible for multiplication 
* and C contained enough space to contain the entire result. 
* FAILURE if A and B were incompatible or C was not big 

* enough to hold the product. The values for SUCCESS and 

* FAILURE are given in ‘matrix.h’. 

x 

* EXAMPLE: Double_Matrix_Type *A, 

* *B, 

€ *C; 

x 

* if (matrix_product(A,B,C) == FAILURE) { 

x 

* printf ("matrix_product(A,B,C) failed.\n"); 

“ exit (EXIT_FAILURE) ; 

* } 

« else { 

x 

* printf("C contains A * B.\n"); 

* } 

x 

ee 
+ / 


fepae} 


151 #ifdef PROTOTYPE 


152 
153 
154 
155 
156 
157 


int matrix_product(Double_Matrix_Type *A, 
Double_Matrix_Type *B, 
Double_Matrix_Type *C); 
#else 
int matrix_product(); 
#Hendif 
[# --nn enn Ssssssss== FUNCTION DECLARATION $===========------------- 
* 
* PURPOSE: To search the elements below and to the right of A(k,k) for 
* the element that 15 maximum in absolute value. 
* 
* INCLUDE: <math.h> (link using -lm if necessary] 
* “ops .h" 
* 
* CALLS: fabs () 
* 
* CALLED BY: 
* 
* PARAMETERS: A is the matrix (structure). k is the index for a position 
* on the main diagonal, A(k,k). The search will be conducted 
* for the area of the matrix that lies below k and to its 
* right: 
* 
* Og) SS SS > 
* | This is the area that will be searched 
* | for an element of maximum absolute value. 
* | The search does NOT include row k nor 
* | does it include column k. 
+ 
* Parameters must also include s, the address of an integer 
* that will contain the row number for the maximum element 
* upon return; and t, an address of an integer to store the 
* column number for the maximum element. 
* 
* NOTE: To search the WHOLE MATRIX, the parameter k should be (-1). 
* The values of k, 8s, and t should be interpreted as the C 
* versions of indexes (i.e. beginning with 0). 
x 
* RETURNS: The function returns the maximum (in absolute value) 
* element found in A (type double). Additionally, the index 
* values for this element are placed in the variables pointed 
* to by 8 (row) and t (col). 
x 
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245 


EXAMPLE: 


Double_Matrix_Type *A; 


double 


ant 


Uu, 


k, 
8, 
t; 


u = max_element(A, k, &s, &t); 


#ifdef PROTOTYPE 


#el 


#en 


+t &2@ &# &# &# 8&8 H&# 8 8 8H He HB HB HB He HB BH HH H HB HB HF 


double max_element (Double_Matrix_Type *A, 


sé 


double max_element(); 


dif 


PURPOSE: 


INCLUDE: 


CALLS: 


CALLED BY: 


PARAMETERS: 


RETURNS: 


EXAMPLE: 


static double 


double 


FUNCTION DECLARATION 


ee ee ee ee ee ee ee ee ee ee ee ee ee ee a ee ee es es es ee es es ee ee ee es ee ee ee ee 
ae ee ee ee we ee ee oe ee ee ee we we we we we es = 2 ee 2 ee ee ew we we oe ee ee = = = 


ant ko intees,. int +t): 


me se ee ie es 
—_—eamam = ae a= a= a 


Computes the p-norm of the input vector x defined in [1] 


(page 53). 


<math.h> 
"ops ; Hh 


fabs () 


xX is the vector. 


double. The p argument is the p of p-norm. 


It must contain "size" elements of type 


A double (scalar) value equal to the p-norm of x. 


Euclidean_norm_of_x; 


Cot 


x 0 3= 171 0,- 230... 3.0) } ; 


251 * Euclidean_norm_of_x = normp(x, 2, 3); 
252. * 


256 #ifdef PROTOTYPE 

258 double normp(double *x, int p, int size); 
260 #else 

262 double normp(); 


264 #endif 


267 [* -onne enn -----=sEe====== FUNCTION DECLARATION ==========------------- 
PURPOSE: To place the outer product of x and y in C. 

INCLUDE: BCpse a 

CALLS: N/A 

CALLED BY: NW/A 


ASSUMPTION: The matrix associated with C is already allocated to the 
proper size. 


PARAMETERS: Two vectors, x and y, of sizes x_size and y_size; and the 
Matrix associated with C to accept the outer product. 


RETURNS: The matrix associated with C is filled with the proper 
values. 


N 
~) 
~~] 
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290 #ifdef PROTOTYPE 

292 void outer_product(double *x, int x_size, double *y, int y_size, 
293 double ¥*#C); 

294 #else 


296 void outer_product(); 


298 #endif 


296 


PURPOSE: To return the number of rows in the matrix A. 


INCLUDE: “ops .h" 


Ww 
=) 
be 
+ #* + & 


307 * sae aeaw ewe ewe ween 22ST Tene eee eae eae eee ee eee eee oe eee eee eee eee eee ee eae eae Pw eO eww ewe ee ee ee 
308 «*/ 


#ifdef PROTOTYPE 


w 
bo 
oO 


311 
312 int rows(Double_Matrix_Type *A); 


314 #else 
316 int rows(); 


318 #endif 


G220/* ——-----------sSsssss2== FUNCTION DECLARATION ==2==-=>>=-——————————— 

PURPOSE: To swap columns p and q in the matrix contained within A. 

326 INCLUDE: “ops .h" 

CALLS: N/A 

CALLED BY: 

PARAMETERS: A is the structure holding the matrix. The integers p and 
q are the column numbers to be swapped. Indexes are 


numbered according to the C convention (beginning at zero). 


RETURNS: Upon return, the columns have been swapped in A. 


to 
w& 
oO 
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341 #ifdef PROTOTYPE 

343 void swap_cols(Double_Matrix_Type *A, int p, int q); 
345 #else 

347 void swap_cols(); 


349 #Hendif 


to 
= 


35] /® -coo---------=ssssss=== FUNCTION DECLARATION ==========------------- 

PURPOSE: To swap rows p and q in the matrix contained within A. 

INCLUDE: "ops.h" 

a6 CALLS: N/A 

CALLED BY: 

PARAMETERS: A is the structure holding the matrix. The integers p and 
q are the row numbers to be swapped. Indexes are numbered 


according to the C convention (beginning at zero). 


RETURNS : Upon return, the rows have been swapped in A. 


wd 
qn 
© 
+t # © &@ &# &© & & & & & & H& H 


0 #ifdef PROTOTYPE 
2 void swap_rows(Double_Matrix_Type *A, int p, int q); 
#else 


4 
5 
6 void swap_rows(); 
7 
8 #endif 

9 


383 /* -------------s======== FUNCTION DECLARATION = =========------------- 


PURPOSE: To initialize the vector v of n integers with the values 
loi 2s ae eens 


INCLUDE: VOps.B 
CALLS: 
CALLED BY: 


ASSUMPTION: The vector, v, has already been successfully allocated as 
an array of n integers. 
397 PARAMETERS: The vector, v, to be initialized; and its size, n. 


RETURNS: The vector’s elements are set to the new values and these 
values are in v[) upon return. 


w& 
© 
th 
+ &© &© &© &# &# & &@ &@ & & & & H& HH H 
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#ifdef PROTOTYPE 

void vec_init(int *v, int n); 
#else 

void vec_init(); 


#endif 


} /* ooo oon Sessssss== PROGRAM INFORMATION ==========------------- 
2 * 

3 * SOURCE : timing.h 

4 ™* VERSION : i 

5 * DATE ; O09 September 1991 

6 * AUTHOR : #£=Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

RB emer nnn nnn ee SS ssssssssss= REFERENCES Ssssssssssssss----------- 
9 * 

10 * REFERENCES 

igta® 

12. * (1] Inmos. The Transputer Databook, Second Edition, 1989. 

HE Wag 

14 * [2] Intel. iPSC/2 Programmer’s Reference Manual. 

15 * 

ro * 

WP Hmmm mmm m nner K SSSSSSSSSSSSra DESCRIPTION SSS SSSSSSSSS Seen mea 
ewe 3 

19 * This file contains definitions of manifest constants, type definitions, 
20 * and function declarations for time-related tasks on the Intel iPSC/2 or 
21 * a network of Inmos transputers. 

22° = 

23)" 

24 8 eee eee ee SS SSSSsssss LIST OF FUNCTIONS ttt _—+_$-t_-$ $44 Er 
25 * 

26 * clock() 

27. * delay() 

28 «= 

29 On a eee eee Se SS SS SSS SSS SS SSS SS SSS SS SSSSS SS SSS SS SSeS S See e2% ee" "42%o°°° 
30 #/ 

31 

32 

33 

34 

35 

36 /* ------------=====S=5== MANIFEST CONSTANTS ==========------------  / 
37 

38 #ifdef TRANSPUTER 

39 

40 #define LO_PERIOD 64.0e-6 /* period of low priority clock */ 
41 #define HI_PERIOD 1.0e-6 /* period of high priority clock a7) 
42 #define LO_FREQ 15625 .0 /* frequency of low priority clock * / 
43 #define HI_FREQ 1.0e6 /* frequency of high priority clock ¥*/ 
44 

45 else /* iPSC/2 */ 

46 

47 #define M_PERIOD 1.0e-3 /* period of Intel’s mclock() * / 
48 #define M_FREQ 1.0e-3 /* frequency for Intel’s mclock() */ 
49 

50 #endif 
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53 /* wee ww ew ee ee tet eee etsste TYPE DEFINITIONS Stes setsss—— =e a oe ewe ew 


* 
55 * The type ‘ticks’ is defined in an effort to make timing a bit more 
* transparent across the machines listed. 

* 


61 #ifdef TRANSPUTER 

63 typedef int ticks; 

65 #else /* iPSC/2 */ 

67 typedef unsigned long ticks; 


69 #endif 


75 /* -------------========= FUNCTION DECLARATION =========------------- 
PURPOSE: To get the time (in ticks) from the processor’s clock. 

78 

INCLUDE: <conc.h> (Logical Systems C, version 89.1) 


“timing. h" 


GALLS: Time() (Logical Systems C, version 89.1) 
mclock() (Intel 1P5¢/2.¢) 


CALLED BY: 

PARAMETERS: Wone. 

RETURNS: The function samples the clock and returns ticks. More 
information on ticks, period, and frequency is given in the 
definitions above. 


EXAMPLE: ticks t(2]; 


ClLON=clock(); 


Ou 
oO 
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30] 


101 #ifdef PROTOTYPE 

102 

103 ticks clock(void); 

104 

105 #else 

106 

107 ticks clock(/* void */); 

108 

109 #endif 

110 

111 

112 

113 

114 

118 /# sore r enn eeeesssss FUNCTION DECLARATIOR sssssssss----- 
116 
117 
118 
119 


PURPOSE: To force a delay of at least a given amount (in seconds) in 
program execution. 


INCLUDE: <conc.h> (Logical Systems C, version 89.1) 
"timing.h" 


CUES. ProcGetPriority() (Logical Systems C, version 89.1) 
Time() (Logical Systems C, version 89.1) 
mclock() (Intel iPSC/2 C) 


123 
124 
CALLED BY: 


PARAMETERS: The (float) argument tells the function the minimum time 
(in seconds) to delay. 


EXAMPLE: delay(1.25); 


+ © + + &@ © &@ & & © © © © F&F © H& H 


137 #ifdef PROTOTYPE 

139 void delay(float seconds); 

141 #else 

143 void delay( /* float seconds */ ); 


145 #endif 


E. GAUSS FACTORIZATION CODE 


The Gauss factorization code appears on the pages that follow. First, the code 
for partial pivoting is given. Since the complete pivoting case was very similar, most 
of it has been omitted to save space. The pivot election function, however, is shown 


in a fragment of gfpcnode.c, the node code for GF with Pivoting (Complete). 
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Bo ww ww wn on nn nn nn nn ne en nn en nn en en nn  — -  e e e = == - 
x 

# PURPOSE : Makefile for Hypercube Gauss Factorization (GF) Program 

# AUTHOR : Jonathan E. Hartman, U. S. Maval Postgraduate School 

# DATE : 26 August 1991 

® 

Rw ww wn wn wn wn wn wn wn wn a a a a a a a a wn wn wr wn wn ee - - 


ROOTCODE=gfpphost 
HODECODE=gfppnode 
HEADER=gf 
NIF_FILE=gfpp 


# ---------------------- OPTIONS AND DEFINITIONS ---------------------- 


#  iPSC/2 Section (MDIR == MatLib directory) 


MDIR=/usr/hartman/matlib/ 


Transputer Section 


The following section establishes options and definitions, starting 
with PP, the Logical Systems C Preprocessor. The ‘-dX’ option (with no 
macro_expression) is like ‘#define X 1’. Wext the compilation options 
for Logical Systems’ TCX Transputer C Compiler are given. The ‘-c’ 
means Compress the output file. The options beginning with ‘-p’ tell 
TCX to generate code for the appropriate processor: 


=e T2i250r 1222 
sods T225 
-p4 T414 
-p45 T400 or T425 
-ps T800 
etches, T801 or T805 


Logical Systems’ TASM Transputer Assembler is next. The ‘-c’ means 
compress the output file (it can cut it in half)! The ‘-t’ is used 
because the input to TASM will be from a language translator (TCX’s 
output) and not from assembly source code. 


The final list tells TLNK which libraries to look at during linking. 
It also establishes an entry point. We use ‘_main’ for the root node 
and ‘_ns_main’ for other nodes. 


xR eReHeRHKH HR NReHHR NR HKHHNHHNRNRNRR NHR NHR NON OR OR 


PPOPT2=-dPROTOTYPE -dTRANSPUTER -dT212 
PPOPT4=-dPROTOTYPE -dTRANSPUTER -dT414 
PPOPT8=-dPROTOTYPE -dTRANSPUTER -dT800 
TCXOPT2=-cp2 


304 


51 TCXOPT4=-cp4 

52 TCXOPT8=-cp8 

53 TASMOPT=-ct . 
Sam 2L1B=t21ib.tll 

55 T4LIB=matlib4.tll t4lib.tll 

56 T8LIB=matlib8.tll t811b.t1ll 

57 RENTRY=_main 

58 WENTRY=_ns_main 

59 


60 

oS DEFAULT ===> MAKE ALL oo wore nnn nnn nn ------ 
62 # 

63 # Comment out one or the other.... 

64 # 

65 # all: ipsc 

66 # run: pa bel 

67 # clean: iclean 

68 all: transputer 

69 run: trun 


70 clean: tclean 


77.%> iPSC/2 Section 
79 ipsc: $(ROOTCODE) $(NODECODE) 


81 $C(ROOTCODE): $(ROOTCODE).o 

82 cc $(ROOTCODE).o $(MDIR)allocate.o $(MDIR)clargs.o $(MDIR)commhost.o $(MDIR)generate.o 
$(MDIR)epsilon.o $(MDIR)io.o $(MDIR)mathx.o $(MDIR)ops.o $(MDIR)timing.o -lm -host 

-o $(ROOTCODE) 

83 

84 $(ROOTCODE).o: $(ROOTCODE).c $(HEADER).h 


87 # Transputer Section 
89 transputer: $(ROOTCODE).tld $(NODECODE).tld 


91 $C(ROOTCODE) .tld: $(ROOTCODE).trl 

92 echo FLAG c > $(ROOTCODE) .1nk 
93 echo LIST $(ROOTCODE) .map >> $(CROOTCODE) .1nk 
94 echo INPUT $(ROOTCODE).tr1 >> $(ROOTCODE).1nk 
95 echo ENTRY $(RENTRY) >> $(CROOTCODE).1nk 
96 echo LIBRARY $(T4LIB) >> $CROOTCODE) .1nk 
97 tlnk $(ROOTCODE).1nk 


99 $(ROOTCODE).tr1: $(ROOTCODE) .tal 

100 tasm $(ROOTCODE).tal $(TASMOPT) 

101 

102 $(ROOTCODE) .tal: $(ROOTCODE) .pp 

103 tex $(ROOTCODE).pp $(TCXOPT4) 

104 

105 $(ROOTCODE) .pp: $(ROOTCODE).c 

106 pp $(ROOTCODE).c $(PPOPT4) 

107 

108 

109 

110 

111 

De NODE) CODE F999 ee 
113 # 

114 

115 #  iPSC/2 Section 

116 

117 $(WODECODE): $(NODECODE) .o 

118 cc $(MODECODE).o $(MDIR)allocate.o $(MDIR)commnode.o $(MDIR)generate.o $(MDIR)io.o 
$(MDIR)mathx.o $(MDIR)ops.o $(MDIR)timing.o -node -lm -o $(NODECODE) 
119 

120 $(WODECODE).0o: $(NODECODE).c $(HEADER) .h 

121 

122 

123 #  ##Transputer Section 

124 

125 $(NODECODE) .tld: $(NODECODE).trl 

126 echo FLAG Cc > $(NODECODE) .1nk 
127 echo LIST $(NODECODE).map >> $(NODECODE) .1nk 
128 echo INPUT $(HODECODE).trl >> $(NODECODE) .1nk 
129 echo ENTRY $(NENTRY) >> $(NODECODE) .1nk 
130 echo LIBRARY $(T8LIB) >> $(NODECODE) .1nk 
131 tlnk $(NODECODE) .1nk 


133 $(HODECODE).trl: $(NODECODE).tal 
134 tasm $(NODECODE).tal $(TASMOPT) 


136 $(WODECODE) .tal: $(NODECODE) .pp 
137 tcx $(NODECODE).pp $(TCXOPT8) 


139 $(NODECODE) .pp: $(WODECODE) .c 
140 pp $(NODECODE).c  $(PPOPTS) 
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148 
149 irun: $(ROOTCODE) $(NODECODE) 

150 $(ROOTCODE) 

151 

152 trun: $(ROOTCODE).tld $(NODECODE).tld $(WIF_FILE) .nif 
153 echo makecube first 

154 ld-net $(NIF_FILE) -t -v 

155 

156 

ee = aa = ES = CLEAN UP ~-senerr errr rn oe nnn een eee 
158 # 

159 

160 iclean: 

161 rm $(NODECODE) .o 

162 rm $(ROOTCODE) .o 

163 rm $(NODECODE) 

164 rm $(ROOTCODE) 

165 

166 tclean: 

167 del $(ROOTCODE).1nk 

168 del $(NODECODE) .1nk 

169 del $(ROOTCODE) .map 

170 del $(NODECODE) .map 

171 del $(ROOTCODE).tal 

172 del $(NODECODE).tal 

173 del $(ROOTCODE) .pp 

174 del $(NWODECODE) .pp 

175 del $(ROOTCODE) .trl 

176 del $(NODECODE).trl 

177 

178 

EAL FBS oe TEE aaa ta ta 
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gfpp.nif 


—_ ra" 
—_— © 
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oOmnN A 1 & WH WwW — 
oo we Wwe we 2e wee we Be WF 


SOURCE : gipp.nif 

VERSION : ie: 

DATE : 14 September 1991 

AUTHOR =: Jonathan E. Hartman, U. S. Naval Postgraduate School 
USAGE : ld-net gfpp 


wean w ewww w nnn SSS ====SS== REFERENCES i ee 


[1] Inmos. IMS BO12 User Guide and Reference Manual. Inmos Limited, 
1988, Fig. 26, p. 28. 


eres et DESCRIPTION Srrssesesrsssees te ee  - - - - - - - - 


Network Information File (NIF) used by Logical Systems C (version 89.1) 
LD-NET Network Loader. This file prescribes the loading action to take 
place when the ‘ld-net’ command is given as in USAGE above. 


========= HARDWARE PREREQUISITES =========------------- 


NOTE: There are three node numbering systems: the one created by Inmos’ 
CHECK program, the Gray code labeling, and the NIF labeling. Since all 
three will be used on occasion, I will prefix node numbers with a C, G, 
or N to identify which system I am using! 


The IMS BOO4 and IMS BO12 must be configured correctly. The B004’s T414 
has link O connected to the host PC via a serial-to-parallel converter, 
link 1 connected to the IMS BO012 PipeHead, link 2 connected to the T212 
[communications manager (not used here)] on the B012, and link 3 
connected to the IMS B012 PipeTail (see [1]). By the way, link 2 from 
the BOO4 goes to the the ConfigUp slot just under the PipeHead slot 
(this connects it to the T212). Finally, the BO04’s Down link must run 
to the B012’s Up link. 


==== SETTING THE C004 CROSSBAR SWITCHES ====------------- 


Once you have connected the hardware in the fashion mentioned above, 
the system is ready to be transformed to a hypercube. Three codes by 
Mike Esposito are used here: t2.nif, root.tld, and switch.tld. I have 
a batch file called ‘makecube.bat’ that performs a ‘ld-net t2’ also. 


Mike’s code passes instructions to the T212 on the B012; which, in-turn 
tells the C004’s how to connect their switches. After the code has 
executed, the (very specific) configuration that we are looking for 
will exist. Specifically, the following (output from CHECK /R) is what 
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Bi; this process gives us: 


i _— 

53 ; check 1.21 

54; # Part rate Mb Bt [ LinkO Linki Link2 Link3 J 

55; O T414b-15 0.09 0 [ HOST 1-41 21 302.) 

56 ; 1 T800c-20 0.80 1 [ 4:3 0-4 Bid 6:0 ) 

on: 2T2 -17 0.49 1[ C004 0:2 hee . (COO4 ) 

58 ; 3 T800c-20 0.80 2 [ 7:3 8:2 0:3 9:0 ) 

59 ; 4 T800c-20 0.76 3 [ 9:3 10:2 11:1 1:0 ) 

60 ; & T800d-20 0.90 1 [ 8:3 1-2 ee Or tans 12-0.) 

61 ; 6 T800d-20 0.76 0 [ 1 Ser 22 if2a, lala 

62 ; 7 T800d-20 0.76 3 [ 13:3 6-2) 14:1 3-0-4) 

6a 5 8 T800d-20 0.90 2 [ 14:3 15:2 3:1 5:0 ] 

64 ; 9 T800c-20 0.77 0 [ 3:3. 0is 2) db 1 4:0] 

65 ; 10 T800d-20 0.90 2 [ 16:3 5:2 4:1 15:0] 

66. 11 T800d-20 0.90 1 [ 6:3 4:2 16:1 13:0] 

ea 12 T800d-20 0.77 0 [ 5.3) 16-2 6:1 14:0] 

68 ; 13 T800d-20 0.77 3 [ 11:3 17:2 9:1 7:0) 

69 ; 14 T800c-20 0.90 1 [ 12:3 (Co? comm We Aes 8:0 J 

70 ; 15 T800c-20 0.90 2 [ 10:3 9:2 S21 9) 47-c.) 

ml 16 T800c-20 0.76 3 £ 17:3 11:2 12:1 #£°10:0] 

i: 17 T800d-20 0.88 2 [ 15:3 14:2 13:1 16:0) 

is ae 

a4: Here node CO is the root transputer (on the IMS B004) and node C2 is 
7 ° the T212 (on the IMS B0i12). The other sixteen nodes are the T800’s 
76 ; that are used for the work. A logical interconnection topology is 
ane described below. 

(ee 

“9 ; 

80 ; eee ee ee ee ee ee SS See ssssssts=sa TOPOLOGY Sesssecsssssese2=------- 
Sle, 

S25; The physical interconnection scheme described above is an actual 4-cube 


Ba; with one exception. The root node (CO) is situated BETWEEN nodes C1 
84; and C3 (which would be connected directly in the usual 4-cube). This 
85 ; gives us two 3-cubes: one whose node labeling is GOxxx and the other, 
S6 ; whose node labeling is Gixxx (where the xxx represents all permutations 


87 ; of 3-bits). These are the usual three cubes, and they will exist if ve 
88; define the node numbering/labeling correctly. 

89 ; 

90 ; 

91 ; wee ee en ee ae a ee See sssssssses==2 STRATEGY Sosssssssssssssse eee ee 
92 ; 

93 ; The node labeling established by the WIF is available via the variable 
94; _node_number (see <conc.h>) in source code. Therefore, we would like a 


95 ; smart labeling scheme in the WIF file so that programming 1s easier. 
96 ; This, of course, is subject to the restriction that WIF labels begin 
97 ; with Ni and so on. 


98 ; 
99 ; One such method would be to define a WIF labeling so that the Gray code 
100 ; label for a node would be (_node_number - 2). In fact, this is 
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Oia: possible and the adjacencies defined below allow us to realize this 
1025; feature. Below, node WO is the host PC, node Ni is the root transputer 
103 ; (T414 on the BO04), W2 through W17 correspond to GO through Gi5 (the 


104 ; nodes of a 4-cube), and 418 is not used (but it’s the T212). 

105 ; 

106 > lee lene nee eel a ee nn an aon aan an pa prepa n—p——b—— ———— 
107 

108 

109 host_server cio.exe; (default) 

110 

1G Up ae TRANSPUTER RESET DESCRIPTION OF LIWK CONNECTIONS 

112 ; WODE LOADABLE COMES 0 _s = = == --* -- -- - -- 8 
1132: ID CODE (.tld) FROM: LIWKO LIWK1 LIWK2 LIWK3 

114 ; = SS SS SS SS SS SSe5 SS Ss a pt o_o SSeS 

115 AV gfpphost, rO; 0, 2, ? 10; BOO4 
116 2a gippnode, Ei 4, tA; a 6; Bo12 
117 on gippnode, Tr 20 ates 2; 5, te 

118 4, gippnode, ro. 12, §, 8, Pas 

119 §, gippnode, rs, 9, Se 4, 1:3; 

120 6, gippnode, ria Pa Te 14, 8; 

12% To gippnode, 19: <p 9, 6, Tee 

122 8, gippnode, r4, 6, 4, 9, 16; 

123 9: gippnode, rs, tie 8, Bs 

124 10’, gfppnode, rid; 14, yO 17, pe 

125 i). gfppnode, 1139 TS. 135 10, 3 

126 123 gfppnode, ri6, 10, 16, 13% 4; 

127 13r gippnode, ri2, §, soa 115 147 

128 14, gippnode, r6, 16, 6, 1S; 10; 

129 ls - gippnode, r14, io 14, 17. ps 

130 16, gfppnode, A i 8, ily 12 14; 

131 de gfppnode, 15s 134 15. sy. 9; 

132055 18; switch, chal : iy : : TZ 
133 

134 

1385 50 toot oor rrr nn Se esssssssss=a EOF gfpp.nif sssssssssssssss------------- 
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y /* --------~---~SSsss===== PROGRAM INFORMATION ==========------------- 
4 * 

Pe * SOURCE ; gf.h 

4 * VERSION : Pa: 

& * DATE : 21 September 1991 

6 ™* AUTHOR : Jonathan E. Hartman, U. S. MWaval Postgraduate School 

7 * 

8 * SEE ALSO: gfpc.mak makefile for the complete pivoting case 

9 * gfpp.mak makefile for the partial pivoting case 

1 «* gipchost.c host code for the complete pivoting case 

i * gipphost.c host code for the partial pivoting case 
120% gipcnode.c node code for the complete pivoting case 

faa * gippnode.c node code for the partial pivoting case 
Py 

15 * 

a -—-—-—---——— -SSSSsssssssas= REFERENCES SSesasessssass------------- 
3 

is * [1] Gragg, William B. MATLAB code and personal conversations, 1991. 
19 *# 

205 * 

21.0 Beer mmm rr nr nr rrr SSeS SSessssse= DESCRIPTION SssSsssssssee—-—------------ 
22 

23 This header file is shared by several programs (listed above). Each of 
24 these codes has something to do with a parallel implementation of Gauss 
25 Factorization (GF). Several pivoting strategies are supported. Files 
26 like gfpc*.* represent a COMPLETE pivoting strategy, and the files like 
27 gipp*.* give the corresponding code for the PARTIAL pivoting scheme. 
28 
29 The basic algorithm is from [1]. Parallelism is sought by distributing 
30 the columns of A across the nodes of a multiprocessor system (using the 
31 hypercube interconnection topology). The program is designed for the 


Intel iPSC/2 or a network of Inmos transputers. 


The algorithm factors Q’AP = LU with P and Q permutation matrices, L 
unit lower trapezoidal (r columns) and U upper trapezoidal with nonzero 
diagonal elements (r rows). The program is designed for a general 
matrix, A. It does not assume A square or sparse. There is no effort 
to optimize for this, or any other, special structure. There 18 one 
caveat: I designed the code to gather data for square matrices of full 
rank. Therefore, I have tested the square case of random matrices very 
carefully. While the code should work for any general matrix, it has 
not been carefully tested in other cases. Additionally, since I sought 
timing data for matrices of full rank, I have NOT addressed the problem 
of gathering columns (back to the host) to the right of the final pivot 
for rank-deficient matrices. This would not be a difficult task, but I 
did not make this effort since it has no bearing on my goal. 

48 In the partial pivoting code, the search for pivots is carried out only 
in the pivot column, so P is the identity (1.e., there are no column 
interchanges). Many of the remaining comments pertain to the complete 


Ge 
fos) 
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a) J 


51 * pivoting case, since it is the most challenging. The changes for the 
52 * partial pivoting case should be evident in most cases. At times, when 
53 * the changes are not necessarily evident, clarifying remarks address the 
54 * partial pivoting scheme. This header file contains the majority of the 
55 * background and algorithm information, but if you’re after a careful 

56 * study of the differences, compare the source codes. The algorithm below 
57 * gives a road map through the code. 

58 

59 ww ww a www ee Se SS SS SS SS SS SSS SS SSS SSS SSS SSS SSS SS eee emer errr sr ---=- 
60 #/ 

61 

62 

63 

64 [BR emer rene nn eS esss=ss=s= ALGORITHM: BACKGROUND ==========------------- 


65 
66 
67 


63 


1.) Preliminaries. Consider A (m xn), a matrix of real numbers. The 
permutation vectors, p and q, characterize column and row permutations 
(respectively). The scalar, (g/a), is the growth factor. The integer, 
r, is a fairly reasonable determination of the ‘numerical rank’ of A. 
The C language convention is followed, numbering rows and columns from 
zero; and storing dynamic, two-dimensional arrays (matrices) in row- 
major-order. The ‘pivot’ will be that element located at A(k,k). The 
area (in A) below and to the right of the pivot [all A(i,j) where i > k 
and j} > k J is called the ‘Gauss transform area’. 


2.) Communications and Coordination. Let NW be the number of processors 
(workers) in the hypercube. These nodes are labeled with a Gray code 
{0 .. (N- 1) }. The root (host) node distributes the columns of A to 
the nodes. This is done cyclically, using the C modulus operator (%). 
That is, column j will be sent to processor (j mod NW). Once the nodes 
have their columns, they begin work. Communication (for the complete 
pivoting case) involves an election process for the next pivot, where 
each of the nodes finds its best candidate and then the election finds 
the best candidate in the global picture. This is done in lg(N) steps 
using the cubecast_from() function. 


The partial pivoting case does not require the election process that 
complete pivoting needs, but both methods look similar (in terms of 
communication) after the elections are complete. The node holding the 
pivot column must perform the pivot column arithmetic and distribute 
the resulting pivot column (also in lg(N) steps) to the other nodes. 
Communications functions are not explained much in this code, but 
details can be found in the files comm.h & comn.c. 


3.) Pivoting Strategy. The complete pivoting strategy’s election 
process (at each stage), determines the element in (the entire Gauss 
transform area of) A that is largest in absolute value. This element 
wins the election and is ‘moved’ to A(k,k) for the upcoming stage. It 
isn’t really moved...but p and q are updated so that we can keep track 
of permutations. During the search for the new pivot, candidates are 


Qn 
Nw 
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101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
M2 
113 
114 
115 
116 
is Wg 
118 
119 
120 
W21 
122 
123 
124 
125 
126 
127 
128 
120 
130 
131 
132 
133 
134 
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denoted A(s,t) = u. The largest of the candidates is installed as the 
next pivot. There seems to be too much overhead associated with this 
fancy indexing off of p[] and q(). For the partial pivoting code, I 
chose to ACTUALLY SWAP rows (if necessary) at each stage. This makes 
the ‘pp’ code a bit easier to read. 


4.) Stopping. The GF process is repeated until one of two criteria is 
satisfied. First, of course, we may run out of matrix. Secondly, we 
may find a pivot whose absolute value is less than our tolerance (tol). 
In the latter case, we have a rank-deficient A. Currently, the codes 
recognize rank-deficiency and bail out of the iteration loop; but they 
do not gather (to the host) all of the remaining columns to the right 
of the last pivot. This is discussed above. 


-------------======== ALGORITHM: THE GF PROCESS =======------------- 


O.) Initialization. Let dim be the dimension of the hypercube. Let 

k = 0. Search A and find the largest (in absolute value) element, u. 
This is done at each node. Once each node has a local candidate for 
the next pivot, an election is held, dimension-by-dimension. This 
requires (dim) steps, and when it is finished, every processor knows 
exactly the position and value of the next pivot. Exception: In the 
partial pivoting code, the processor which has the pivot column simply 
searches the (proper part of the) pivot column for the next pivot and 
then informs the other processors. 


1.) Status. Every node knows the position and value of the next pivot, 
namely u = A(s,t); and where it should be installed, A(k,k). The growth 
rate is adjusted: g = max[g, abs(u)]. If (u < tol), then A is rank- 
deficient and we exit the loop (using the C ‘break’ statement). 


2.) Permutations. We account for the interchange of rows s and k and 
columns t and k by swapping the elements of p[) that are indexed by k 
and t and swapping the elements in q{) indexed by k and s. This 
(effectively) establishes the new pivot at A(k,k). The column permu- 
tation vector has no significance in the partial pivoting case since 
it would never be changed. The matrix, P, in this case, is simply the 
identity. 


3.) Adjust the Gauss Transform Area. 


(a) In the (single) node that holds the new pivot’s column (k), 
divide every element below the pivot by the pivot value. Broadcast 
this column to every other node. Wode O updates the manager, who 
uses this information to append to his copy of the resulting 
(factored) A. 


(>) Now every worker has the updated column k. At every node, do 
the following: For every element A(i,j) [ where i > k and j >k] 
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let ACi,j) = AGi,j3) - GG.) * ACK, 922, 


4.) Pivot Search. In the Gauss transform area, G, search for the 
element that is largest in absolute value. Its position is A(s,t) and 
its value is u. The candidates are chosen at the local (processor) 
level, then an election is held at the global level to determine the 
best candidate in the same manner that was described in step 0. 
Increment k. Repeat the process (go back to step 1). The obvious 
exceptions apply to the partial pivoting case. 


eoeee nena ssssesss=== WOTES FOR IMPROVEMENT #$==========------------- 
Currently the code does not give full support for rank-deficiency. It 
DOES break out of the loop, but everything to the right of the final 
pivot column will be garbage. It would be relatively easy to add the 
necessary post-iteration rank-deficiency check and coalesce each of the 
remaining columns back to the manager, but this code was created to 
test the full-rank cases and take performance data. 


Secondly, there is the issue of whether it is better for the manager to 
receive each pivot column as it becomes available, or if all columns 
should be sent in at the end. I’m not yet sure which method is better, 
but the current code keeps the root node up-to-date at each stage. This 
is probably the best solution to the problem above and would probably 
enhance performance during the iterations! It REALLY SHOULD BE TESTED! 


There are many other questions that pertain to optimization that remain 
unanswered (especially in the complete pivoting case). 


wenn na a eS essssssss= ALGORITHM: CONCLUSION ===2s22222------------- 


1.) Rank. Set r, the rank of A, equal to the number of iterations that 

were executed. This is automatic in the manager (host) code since 
the integer, r, is used as the loop index. The worker nodes use k for 
a loop index variable. 


2.) Interchanges. Row and column interchanges are not actually done in 
the complete pivoting code. Instead, we maintain permutation vectors, 
p({] and qf]. You may note that while both vectors are used heavily 
during the GF process g[], in particular, comes in handy at the end to 
set A in order. The partial pivoting code performs the actual inter- 
changes of rows. At first, we would be inclined to believe that the 
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201 * indexing by p[] and q[] leads to better performance, but there is no 
202 * clear timing evidence (at this point) that supports this idea. 

203. * 

204 * 3.) Factors. The upper trapezoidal matrix, U, is the upper trapezoid 
205 * of (the resulting, factored) A (the diagonal of A and everything above 
206 * that). The lower trapezoidal matrix, L, is formed by placing ones on 
207 * the diagonal of A; zeros above; and copying the lower trapezoid of A 
208 * (excluding the diagonal). To form Q’AP, we use THE ORIGINAL copy of A 
209 * (not the factored, resulting A) and the matrices Q and P that are 

210 * implied by q{] and p[]. That is, in the end, we set Q(qliJJ[i] = 1.0 
oie eetor@all i ine{ Oyetyp ..., (a-1) ep randeser Pip[j))] (jJ>:=41.0for allj 
mee * in { 0, 1, ..., (n-1) }. 

213. * 

214 Ke wow ewe @ Mm ww wm See eee Sa ae ee ee ee eS SSS SS SSS SSS SSS SSSSSS5 S555 e°°e" e782" e2°°° 
ae */ 

216 

217 

218 

219 

220 

221 /# -------------SSesssssss MANIFEST CONSTANTS SSSSSSSSS Sommerer 
222 * 

223 * 

224 * Section 1: Communications Aids (Message Types and Type Selectors) 

a0 «| * 

226 * The following manifest constants simplify the communications effort. 
227 * The TRANSPUTER section is fairly general in mature. The iPSC/2 section 
228 * specifies types and type selectors for csend() and crecv(). It IS 

229 * SIGNIFICANT that NODE_OFFSET is the largest of these. It must remain 
230 * the largest so that (for all nodes n) the value of (n + WODE_OFFSET) 
231 * cannot be equal to one of the other message types (consider n == 0). 
232 * 
ee 
234 */ 

235 

236 

237 #ifdef TRANSPUTER 

238 

239 #define CUBESIZE 8 /* change these for a cube of other dim */ 
240 #define DIMENSION 3 

241 

242 #else /* iPSC/2 */ 

243 

244 #define ARG_TYPE 1 /* for passing command line argument info */ 
245 #define COL_SIZE_TYPE 2 /* for sending n part of size(A) ==> cols x / 
246 #define COL_TYPE 3 /* use this to send a column % / 
247 #define PIVOT_TYPE 4 /* candidate for next pivot */ 
248 #define PCOL_TYPE & /* use this to send a pivot column «/ 
249 #define ROW_SIZE_TYPE 6 /#* for sending m part of size(A) ==> rows */ 
250 #define NODE_OFFSET 7  /* for sending messages from nodes */ 
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258 


271 


273 


2795 


280 


286 
287 


289 


298 


#endif 


the 


and 
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#Hdefine 


#define 
#tdefine 
#define 
#define 
#define 


/* The 
#define 
#define 


/* The 
#define 
#define 
#define 
#define 
#define 


/* The 
#define 
#define 
#define 
#define 


/* The 
#define 
#define 


Section 3: 


Timing 


following scheme. 


so On. 

MAX_EVENTS 18 /*# 
DATA_SOURCE O08 806/* 
START_TIME 1 /* 
SETUP 2 /* 
DISTRIB_COLS 3 /* 
FIRST_PIVOT 4 /* 


next two only apply to 
PCOLS_TO_HOST 5 /# 
PIVOTS_TO_HOST 6 /* 


The nodes, 


number of events that we want to time 


node 
t (0) 
from 
time 
from 


nodes 
time 
time 


number of source of the data 

==> starting time for the node 

t(O) until starting to receive cols 
to distribute columns 

receipt of last col to start iter 


zero and eight 


spent 
spent 


passing pivot cols to host 
passing pivots to host 


next five kind of represent the big picture 


time 
time 
time 
time 
time 


spent 
spent 
spent 
spent 
spent 


on pivot elections 

updating permutations p and q 
on pivot column arithmetic 
distributing pivot columns 
updating the Gauss transform 


next four are times from within update_G() 

pivot row location time 

time to determine if a column is local 
time spent on arithmetic within G 

time for both for() loops in update_G() 


PIVOT_ELECTION 7 /» 
UPDATING_PQ 8 /* 
PCOL_ARITHMETIC 9 /*# 
PCOL_DISTRIB 10 /# 
UPDATING_G 11 8 /* 
PRLTIME 12 /* 
LCTIME 13 /* 
G_ARITHMETIC 14 /« 
LOOPTIME 1S /* 
last two are back 

ITERATION 16 /* 
STOP 17 /* 


at the big picture level again 


time checked before and after iteration 
the last time sampled by the node 
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The root uses a two-dimensional array where the rows are indexed by the 
node numbers and the columns use the following indexing. 
course, only need a one-dimensional array with indexing according to 
There a total of MAX_EVENTS elements in the 
array, and indexing for a specific event is given by START_TIME, SETUP, 
The partial pivoting case does not use all of the events. 


of 


+ / 


+*/ 
a7 
*/ 
*/ 
«/ 


*/ 
+ / 
*/ 


«/ 
*/ 
*/ 
*/ 
*/ 
*/ 


*/ 
a7 
if, 
*/ 
*/ 


*/ 
wih 
*/ 


306 * Section 4: General 


311 #define AFT 4 /* number of digits to print after decimal */ 
312 #define WIDTH 6 /* number of characters (including decimal) */ 


Section 5: A special flag used for the id field of a pivot. When it 

appears, it indicates that the sending node’s part of A has 
no elements as big as the tolerance, tol; and therefore this node’s 
candidate for pivot should not be considered. 


w 
to 
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327 
328 
329 #define RANK_DEFICIENT -1 
330 
331 


335 /* wee nr ne ee ee eS SSS TYPE DEFINITIONS Sort rtrrrrrr------------ * / 


337 

338 typedef struct { 
339 

340 art Die 
341 double u; 
342 int 8, 
343 t; 
344 

345 } Pivot_Type; 
346 


gfpphost.c 


/* wee ee ee Sees PROGRAM INFORMATION Seer sssss2—---- - - - - ~~ === 
% 
* SOURCE gfpphost.c 
* VERSION : 2.0 
* DATE 21 September 1991 
* AUTHOR Jonathan E. Hartman, U. S. Waval Postgraduate School 
* 
Ko ween ee ee SS SSS SSeS SeSsS== DESCRIPTION sSrssrssssssss------------- 
* 
* Gauss Factorization (GF) with Partial Pivoting: Parallel Version. 
* This is the manager portion of the code. See [gf.h) for details. 
* 
5 Nl lh eet tt — oat no eee 
*/ 
#include <stdio.h> 
#include <string.h> 
#ifdef TRANSPUTER 


#Hinclude 
#include 


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


telse 


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#endif 


#include 


<conc.h> 


<stdlib.h> /* addfree(), _heapend * / 
<matrix.h> 

<macros.h> 

<allocate.h> 

<clargs.h> 

<comm.h> 

<epsilon.h> 

<generate.h> 

<io0.h> 

<ops.h> 


<timing.h> 


/* iPSC/2 */ 


"/usr/hartman/matlib/matrix.h" 
"/usr/hartman/matlib/macros.h" 
"/usr/hartman/matlib/allocate.h" 
"/usr/hartman/matlib/clargs.h" 
"/usr/hartman/matlib/comm.h" 
"/usr/hartman/matlib/epsilon.h" 
"/usr/hartman/matlib/generate.h" 
"/usr/hartman/matlib/io.h" 
"/usr/hartman/matlib/ops.h" 
"/usr/hartman/matlib/timing.h" 


et Hh" 
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51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 


/* wee me ee ee ew me em See ees MANIFEST CONSTANTS Bosses rcsees—---- ~~ - - - 
* 
* The following manifest constants are used to determine the size of the 
* option list, optv([]; indexing associated with valid command line 
* arguments; and selection constants for the user’s choice of matrix type 
* [used in generate()]. 
* 
«/ 
#define NUMBER_OF_ARGS 3 ;* >) -G-t -v «/ 
#define DIM 0 /* index into optv([] */ 
#define TIMING 1 /* . a . */ 
#define VERBOSE 2 (ee “ * i * / 
#define SELECT_QUIT 0 /* menu / matrix selection */ 
#define SELECT_IDENTITY 1 
#define SELECT_HILBERT 2 
#define SELECT_RANDOM S 
#define SELECT_WILKINSON 4 
% /* ee eed ttt too GLOBALS Ssssssrssssssssrae------ - - * / 
static char version(] = "Parallel GF with Partial Pivoting, Version 2.0"; 


#ifdef TRANSPUTER 


Channel *ic((CUBESIZE + 1)], 


*oc((CUBESIZE + 1)]; 
#else /* iPSC/2 */ 
static char *cubename; 
static char *nodecode = "gfppnode"; 


#endif /* TRANSPUTER */ 


static Arg Struct *optv(NUMBER_OF_ARGS] ; 
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101 

102 

103 

104 /# -e-eee------SsSSSSSS=== FUNCTION DEFINITION SSSSSSSSSSS------------ 
105 * 

106 * The structure is defined more carefully in clargs.h, but the basic idea 
107 * is that we have an array of pointers to type Arg_Struct...in this case, 
108 * there are NWUMBER_OF_ARGS valid arguments and the next few steps take 
109 * care of allocation and definition of them. The -d argument allows the 
110 * user to enter the desired dimension of the hypercube, -t sets timing on 
111 * and -v is used to set verbose on. 

112 */ 

113 

114 void define_valid_args() { 

Hs 

116 static int interpret[] = { LONG }; 

aes 

118 

119 install_complex_arg (DIM, Opty, “-d" interpret... 

120 

121 install_simple_arg(TIMING, optv, "-t"); 

122 install_simple_arg(VERBOSE, optv, "-v"); 

123 

124 } 

1254/8 End definetval 1dear gS) eee eee ee ee * / 
126 

P27 

128 

129 

130 

131 /® eee -------Sssssssasss FUNCTION DEFINITION SSSSSSS=S==------------ 
132 * 

133 * A simple function to display the results.... 

134. */ 

135 

136 #ifdef PROTOTYPE 

137 

133 void display_timing data(Double_Matrix_Type *#A, 

139 int din, 

140 double a, 

141 double eps, 

142 double g> 

143 double tol. 

144 int 5 a 

145 double **t) 

146 

147 #else 

148 

149 void display_timing_ data(A, dim, a, eps, g, tol, r, t) 

150 
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151 Double_Matrix_Type *A; 
152 int dim; 
153 double a, 
154 eps, 
155 g 
156 tol: 
157 Int re 
158 double **t; 
159 

160 #endif 

161 { 

162 int att, 

163 cubesize = pow2(dim), 
164 nae 

165 m = A->rows, 

166 n = A->cols, 

167 width; 

168 

169 


170 #ifdef TRANSPUTER /* is measured in 64 microsecond ticks ==> 4-5 places */ 


172 aft = §; 
173 width = 15; 
174 


175 #else /* iPSC/2 is measured in milliseconds ==> three places*/ 


ar aft = 

175 Badth = 13; 

179 

180 #endif 

131 

182 printf ("--------------========= TIMING DATA =========------ ip 
183 print? (¢=——-——=- Nn An) 

184 

185 print? (" Hypercube of order %d ", dim); 

136 (dim == 0) ? (printf("(1 processor) \n\n")) 

187 (printf("(%d processors)\n\n", cubesize)); 
188 

189 printf("Problem size ==> size(A) = (%d x %d).\n", m, n); 
190 printf("Machine precision: eps = ‘%e\n", eps); 

191 printf("Tolerance: tol = Ye\n", tol); 

192 printf("Growth factor: g/a = ‘%e\n", (g/a)); 

193 printf ("Rank: rFank(a) =/,3d\n" 85 > 

194 printf("Units for timing data: = seconds\n"); 

195 

196 for (i = 0; i < cubesize; i++) { 

197 

198 Pumnitiunn Mode. 42d Data =““=—<=—-=—=——s— Sa ——Sa-a =e —- Oats) -- 
199 Pigtoti( eee = — << = —————————— Nn) | 

200 
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201 printf("Setup and initialization: ap 

202 printf("%*.*lf", width, aft, tli) [sclvURi. 

203 printf("\nInitial column distribution: \e 

204 print(("/,*sdf", width, afc. tli) (DIsinipeclrsi.. 

205 

206 if (i == 0) { 

207 

208 printf("\nTransmission of pivot columns to the host: Jeg 
209 printf('/*.s2f"ovadth, att. ot [i GPcors 1oekost))); 

210 printf("\nTransmission of pivots to the host: a 
211 printt("4*.*li"g@waidth, aft, tli) lPivOtS=102HOSi)); 

oe . 

213 

214 printf("\nPerformance of pivot column arithmetic: ae) 

215 printf("%*.*lf", width, aft, tli] (PCOL_ARITHMETIC]); 

216 printf("\nDistribution of pivot columns: ale 

217 print!("/*.*11", width, saft. C01) EPCOLeDISTRie)) 

218 printf("\nPerformance of updates and arithmetic in G: "); 

219 printf("%*.*1f", width, aft, t[i] (UPDATING_G)); 

220 printf("\nUpdate_G(): loop time including arithmetic: "); 

25) printf("%*.*1f", width, aft, t({ijJ (LOOPTIME)); 

222 

223 printf("\n\nTime for all work inside main iteration loop: "); 

224 printf("%*.*1f", width, aft, t{i] (ITERATION]); 

225 printf("\nTotal time from start to stop: a 

226 printt("%*.*1f\n\n", width, aft, (tCij (STOP)-t (ij CSTART_TIME))); 
227 } 

225 

229 } 

230 /* End display_timing_data() ~---<<-~----~--3--3-- 3-3-9 - 3-93 = === === 3 == === * / 
231 

232 

233 

234 

235 

236 /*# ------------SSSSsS===== FUNCTION DEFINITION =SSSSSSS===------------ 
237 * 

238 * This function distributes the columns of A to the nodes of the hyper- 
239 * cube. The loop variable, j, designates each column of A in turn. The 
240 * column buffer, cbuf({], copies from A the column to be transmitted. 

241 * After cbuf({] is filled, [i = (j mod cubesize)] means that node i will 
242 * get column j and the modulus operation seems to be a reasonable and 
243 * efficient scheme of distribution. Finally, the call to send() ships 
244 * the column out to the appropriate node. 

245 * 

246 mm rn re nn 
247 */ 

248 

249 #ifdef PROTOTYPE 

250 
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251 void distribute_columns(Double_Matrix_Type *A, int dim, double *cbuf) 
252 

253 #else 

254 

255 void distribute_columns(A, dim, cbuf) 

256 

957 Double_Matrix_Type *A; 

258 oly 9c din; 

259 double *cbuf; 

260 

261 #endif 

262 { 

263 

264 chon Ge 

265 ae 

266 pos = 42, /* position of print head «/ 
267 rm LINE_LENGTH - 10; /* right margin (see matrix.h) */ 
268 

269 long cubesize = pow2(dim), 

270 sizeof_col = (long) (A->rows * sizeof(double)); 

271 


273 printf("Distributing the columns of A to the nodes"); 
275 fomecj) = O93 < A->cols; j++) f 


for (i = 0; i < A->rows; i++) { cbuf[i] = A->matrix({i][jJ; } 


280 i = j 4% cubesize; /* column --> node i «/ 
282 #ifdef TRANSPUTER /* node O has to sort ’em out */ 
284 a Ge 5) 4 

286 send(0O, (char *) cbuf, sizeof_col, cubesize); 

287 } 


288 else { 


290 send(8, (char *) cbuf, sizeof_col, cubesize); 
291 } 


293 #else /* iPSC/2 */ 
295 send(i, (char*) cbuf, sizeof_col, COL_TYPE); 
297 #Hendif /* TRANSPUTER */ 


299 pringsc’.) > 
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301 if (pos++ > rm) { 

302 

303 pos = 0; 

304 pemes ('\n"); 

305 iF 

306 

307 } 

308 

309 printf("\nColumn distribution complete. \n\n") ; 

310 

311 } 

312 /* End distribute_columns() ---------------- 999-9922 -- = ---- - - - = « / 
313 

314 

315 

316 

317 

318 /* ------n nnn SsSsssSscss FUNCTION DEFINITION SSeS SS55------------ 


320 * This function prompts the user for matrix size and type, then generates 
321 * the matrix with a call to a function from generate.c. 


322 #/ 

323 

324 

325 #ifdef PROTOTYPE 

326 

327 Double_Matrix_Type *generate(int *m, int *n) 
328 

329 #else 

330 

331 Double_Matrix_Type *generate(m, n) 

332 

333 phe *m, 

334 =T)- 

335 #endif 

336 { 

347 Double_Matrix_Type *A; 

338 

339 int matrix _type, 

340 valid = FALSE; 

341 

342 

343 printf("Please enter the number of rows in A: "); 
344 Scairt (7d im): 

345 fflush(stdin) ; 

346 

347 PlINGeC: \MNeseeeeae and the number of columns in A: "); 
348 scanf("/d", n); 

349 fflush(stdin); 

350 
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351 printf("\n\nSelect from the following list of matrices:"); 
352 

353 while ('valid) { 

354 

355 paint CNN \n'2) ; 

356 Prancts ce %a.) QUIT \n", SELECT_QUIT ye 

357 Dragtt ‘&. 4a.) Identity \n", SELECT.IDeerITY ); 

358 proptt(" “a,) Hilsbert \n", SEWECT_HILBERT ); 

359 printf CY %a.) Random \n", SELECT_RAWDOM bee 

360 printr(: %d.) Wilkinson \n", SELECT_WILKINSON): 

361 Prants ("\n>"?) 5 

362 scanf("“%d", &matrix_type); 

363 fflush(stdin); 

364 

365 switch(matrix_type) { 

366 

367 case SELECT_IDENTITY 

368 case SELECT_HILBERT 

369 case SELECT_RANDOM 

370 case SELECT_WILKINSON : valid = TRUE; break; 
a7 

a7 case SELECT_QUIT : exit (EXIT_SUCCESS) ; 
373 } 

374 

375 } /* end while() */ 

376 

377 

378 switch(matrix_type) f{ 

379 

380 case SELECT_IDENTITY: 

381 

382 printf("\n\nGenerating A = identity(/d, “%d).\n\n", *m, *n); 
383 

384 A = identity(*m, *n); 

385 break; 

386 

387 case SELECT_HILBERT: 

388 

389 printf("\n\nGenerating A = hilbert(/d, %d).\n\n", *m, *n); 
390 

391 A = hilbert(*m, *n); 

392 break; 

393 

394 case SELECT_RANDOM: 

395 

396 printf("\n\nGenerating A = mxrand(%/d, %d).\n\n", *m, *n); 
397 

398 A = mxrand(*m, *n); 

399 break; 

400 
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} 
/ 


a 


t 


tt 
{ 


case SELECT_WILKINSON: 
printf("\n\nGenerating A = wilkinson(/d, 4d).\n\n", *m, *n); 


A = wilkinson(*m, *n); 
break; 


if 1CUA et 
printf("generate(): Allocation failure for the matrix A.\n"); 
exit (EXIT_FAILURE) ; 

} 


return(A); 


* End generate() ------- 9 n-ne rrr rrr nnn rrr n rrr rccnn + / 


Beem em men meee SS SSSsSsssss=f FUNCTION DEFINITION Sssssssessss---- 
* 
* Collect timing data from the nodes. The Intel side of this function 
* takes advantage of the host’s ability to receive from any node. The 
* transputer side must receive every node’s information from nodes zero & 
* eight (eight only becomes involved in the case of the hybrid 4-cube). 
ifdef PROTOTYPE 

double **receive_timing_data(int cubesize) 


else 


double **receive_timing_data(cubesize) 


int cubesize; 
endif 
double **dt; /* (double) version of t[J[] * / 
Int oar 
ji 
long tlen; /* length of one node’s data a / 
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451 
452 
453 
454 
455 
456 
457 
458 
459 
460 
461 
462 


498 


ticks **t; /* raw timing data from nodes-~ */ 


* Perform allocation for the timing dt t[][J. The two-dimensional 

* array is indexed by node number for the rows and by event for the 

* columns. For instance, t[i][j] means the time required for event 

* j at node 1. Actually, there is an extra row reserved at the end 

* of t{][J for totals: t[cubesize][j] gives the total time for event 

* j across all nodes. 

if (!(dt = (double **) malloc((cubesizet1) * sizeof(double*)))){ 
printf("receive_timing_data(): Allocation failure for dt{J(J.\n"); 
eLIt (Ee. ITPAILURE):; 

} 

for (i = 0; i < (cubesize + 1); it+) { 


if ('(dt[i] = (double *)calloc(MAX_EVENTS , sizeof (double)))){ 


printf("Host: Allocation failure for dt{%d].\n", i); 
exit (EXIT_FAILURE) ; 


} 

if ('(t = (ticks **) malloc((cubesizeti) * sizeof(ticks*)))) { 
printf("receive_timing data(): Allocation failure for t(][].\n"); 
exit (EXIT_FAILURE) ; 

fom (i= 9071 < (cubesize + 1); i++) { 
if ('(t{li] = (ticks *) calloc(MAX_EVENTS, sizeof(ticks)))) { 


printf("Host: Allocation failure for t{%d].\n", i); 
exit (EXIT EATLURE) ; 


} 

printf("Receiving timing data from the nodes"); 
tlen = (long) (MAX_EVENTS * sizeof(ticks)); 

for (i = 0; i < cubesize; i++) { 


printt("."); 


501 
502 
503 
504 
505 
506 
S07 
508 
509 
5910 
511 
512 
913 
514 
515 
516 
517 
918 
519 
520 
521 
522 
523 
524 
525 
526 
527 
528 
529 
530 
931 
532 
533 
534 
535 
536 
537 
538 
539 
540 
941 
542 
543 
544 
545 
546 
547 
548 
549 
550 


#ifdef TRANSPUTER 


#else 


if (i < 8) receive(0O, (char *) t[i], tlen, cubesize); 
else receive(8, (char *) t[i], tlen, cubesize); 


/* iPSC/2 */ 


receive(i, (char *) t{i], tlen, (i + MWODE_OFFSET)); 


#endif /* TRANSPUTER */ 


} 


printf ("\n\n"); 


*/ 


tor 


/* 


ar 


Calculate totals, averages; place totals in t[{cubesize] first.... 
then copy to dt[][] and record averages in dt[cubesize]. 

(i = 0; i < cubesize; i++) { 

for (j = 0; j < MAX_EVENTS; j++) t[{cubesize][j] += tli] (jl; 

Fill dt{J(] with double values (in seconds). The conversion 
factors are borrowed from timing.h. 

(i = 0; i <= cubesize; i++) { 

dt{i] (DATA_SOURCE) = (double) t{i] [DATA_SOURCE]; 


for (j = START_TIME; j < MAX_EVENTS; j++) { 


#ifdef TRANSPUTER 


#else 


tendif 
} 
/* 


for 


dt{il(j] = ((double) t[i][j]) * LO_PERIOD; 
dt Cil(j] = ((double) t(i]{j]) * M_PERIOD; 
} 
Convert totals to averages in dt[cubesize] */ 


(j = START_TIME; j < MAX_EVENTS; j++) { 
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551 

552 dt(cubesize][j] /= ((double) cubesize); 

553 i 

554 

555 

556 for (1 = 0; i < (cubesize + 1); i++) free(t{i]); 

557 free(t); 

558 

559 return(dt); 

560 } 

Bempeernd meceavé timing data() =--==---=---——---—----<------ nan nnn nee */ 
562 

563 

564 

565 

566 

567 [® oor ern --- Se ssesssssza FUNCTION DEFINITION Sssssssssssse-e-------- = 
568 * 

569 * This function analyzes the command line that the user supplied and sets 
570 * variables accordingly. The valid arguments are given by define_valid_ 
571 * args(), and the real work is passed off to interpret_args(), from the 
572 * clargs library. 

573 */ 


575 #ifdef PROTOTYPE 


577 void resolve_args(int argc, char *argv[], 
578 int *dim, int *timing, int *verbose) 


580 #else 
582 void resolve_args(argc, argv, dim, timing, verbose) 


584 int argc; 

585 char *argv[]; 

586 int *dim, 

587 *timing, 

588 *verbose; 


590 #endif 

591 { 

592 int maxdim = 
593 valid = FALSE; 


1 
WwW 


596 interpret_args(argc, argv, NUMBER_OF_ARGS, optv); /* see clargs.h */ 
598 #ifdef TRANSPUTER 


600 *dim = DIMENSION; 


gfpphost.c 


601 
602 
603 
604 
605 
606 
607 
608 
609 
610 
611 
612 
613 
614 
615 
616 
617 
618 
619 
620 
621 
622 
623 
624 
625 
626 
627 
628 
629 
630 
631 
632 
633 
634 
635 
636 
637 
638 
639 
640 
641 
642 
643 
644 
645 
646 
647 
648 
649 
650 


Helse /* iPSC/2 */ 
if (optv(DIM]->found) *dim = (int) optv(DIM]->1sa[0]; 
switch (*dim) { 
case 0: case 1: case 2: case 3: break; 


default: while ('valid) { 


printf("Enter desired cube dimension (0...%4d): “, maxdim); 


acant("'/.d'7 dim). 
fflush(stdin) ; 


switch(*dim) { 


case 0: case 1: case 2: case 3: 
valid = TRUE; 
break; 


} 
} 
+} /* end switch() */ 


#Hendif /* TRANSPUTER */ 


FALSE) ; 


Coptv(TIMING]->found) 7? (*#timing = TRUE) : (*#timing = FALSE); 
(optv[VERBOSE]->found) ? (*verbose = TRUE) : (*verbose = 
printf("Argument resolution complete...\n\n"); 
printic' Cube Dimension: ‘%d\n", *dim); 
if (*timing) printf(" Timing: ON\n, > 
(*verbose) ? (printf(" Verbose Mode: ON\n\n")) 
(print tC ne): 
s, 
/* End resolve_args() ---------- enn nnn n nnn nnn nnn nnn nnn enna === 
[* ere een ee essssssss== FUNCTION DEFINITION SSS = SSS 
a 
* / 


#ifdef PROTOTYPE 
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651 
652 
653 
654 
655 
656 
657 
658 
659 
660 
661 
662 
663 


698 


} 


void show_resulting_matrices(Double_Matrix_Type *A, 
Double_Matrix_Type *AO, int *q) 
Helse 
void show_resulting_matrices(A, AO, q) 
Double_Matrix_Type *A, 
*A0; 
int *q; 
tendif 
{ 
Double_Matrix_Type *D, 
*L, 
*LU, 
*P, 
*QT, 
*QTA, 
*QTAP, 
*U; 
int 1, 
ae 
m = A->rows, 
nh = A->cols; 
printf("Gauss Factorization Complete...\n\n"); 
strcpy(A->name, "A (after GF operations)"); 
/* Allocate and form Q’ and P -------------------- 
if ('(QT = matalloc(m,m))) { 
printf("Allocation failure for QT.\n"); 
exit (EXIT_FAILURE) ; 
} 
strcpy(QT->name, "Q Transpose"); 
for (i = 0; i<m; it+) { QT->matrix(iJ(q(lijJ] = 1.0; 
if (!'(P = identity(n,n))) { 
printf("Allocation failure for P.\n"); 
exit (EXIT_FAILURE) ; 
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701 } 

702 

703 strcpy(P->name, "P [ Partial (column) Pivoting ==> P == Identity J"); 
704 

705 

706 /* Here, we slowly form Q’AP, keeping in mind that the A ve are 
707 * talking about is the original A....and we have labeled that one 
708 * AQ. Therefore, we first form QTA (Q’A) as Q’ * AO. After we 
709 * have QTA, we can multiply it (on the right) by P to get Q’AP, 
710 * or QTAP as it is called here. 

711 * / 

712 

713 if ('(QTA = matalloc(m,n))) { 

714 

715 printi("Allocation failure for QTA-\n™): 

716 exit (EXIT_FAILURE) ; 

ies } 

718 

719 strcpy(QTA->name, "Q’ * (original) A"); 

720 

721 if (matrix_product(QT, AO, QTA) == FAILURE) { 

722 

723 printf ("matrix_product(QTA) Failure.\n"); 

724 exit (EXIT_FAILURE) ; 

725 i 

726 

727 

728 if ('(QTAP = matalloc(m,n))) { 

Meg 

730 printf("“Allocation failure for QTAP.\n"); 

731 exit(EXIT_FAILURE) ; 

732 } 

733 

734 strcpv(OTAP->name, "Q’ * A * P"); 

735 

736 a bp 4 (matrix_product(QTA, P, QTAP) == FAILURE) { 

737 

738 printf("matrix_product(QTAP) Failure.\n"); 

739 exit (EXIT_FAILURE) ; 

740 } 

741 

742 

743 /* MWext, we form L and U so that we can compare Q’AP ?7=? LU. + / 
744 

745 L = zeros(m, n); L->name = "L "; 

746 U = zeros(m, n); U->name = "U "; 

747 

748 for (i = 0; i < A->rows; i++) { 

749 

750 for (j = 0; j < A->cols; j++) { 


751 
752 
753 
754 
755 
756 
757 
738 


if (i < j) { U->matrix(iJ](j] = A->matrix(i](jJ; } 


Pre == 7) 4 


L->matrix (i) [3] 
U->matrix[i) (j] 


10 * 
A->matrix([i][j); 


} 
af Gi > j) { L->@atrixfil([j) = A-Smatrix(i)[j); } 
} 
tet '(LU = watallocG@mn))) { 
printf("“Allocation failure for LU.\n"); 
exit (EXIT_FAILURE); 
} 
strcpy(LU->name, "L * U"); 


if (matrix_product(L, U, LU) == FAILURE) { 


printf ("matrix_product(LU) Failure. \n"); 
exit (EXIT_FAILURE) ; 


/* Finally, we create a matrix of differences between the elements 
found in QTAP (Q’AP) and LU. If everything proceeded according 
* to the plan, this will be a matrix of zeros. 
*/ 
if ('(D = matalloc(m,n))) { 
printf("Allocation failure for D.\n"); 
exit (EXIT_FAILURE) ; 
} 
strcpy(D->name, "Q’AP - LU"); 
Por Gig= 0; 1 < 9m; i++) { 
fomsGje= 0-04 < n; j++) 4 
D->matrix({i}J(jj] = (QTAP->matrix({iJ(j] - LU->matrix(iJ(jJ]); 


I 


printmd(*A, WIDTH, AFT); 
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801 
802 
803 
804 


printf("\n\n"); 
printmd(*L, WIDTH, AFT); 
printf ("\n\n"); 
printmd(*U, WIDTH, AFT); 
printf("\n\n"); 


printmd(*QT, WIDTH, AFT); 
printf("\n\n"); 
printmd(+*P, WIDTH, AFT); 
prime? C'\a yn); 
printmd(*QTA, WIDTH, AFT); 
Printi( .n.a)., 
printmd(*QTAP, WIDTH, AFT); 
prantf(“\a\ne), 
printmd(*LU, WIDTH, AFT); 
printiC on); 
printmd(*#D, WIDTH, AFT); 
Print! (wn), 


End show resulting _matrices() =-—-----______________________ =a » / 


Soot FUNCTION DEFINITION Sot ssersesseeer----------- 


This is a simple function to physically swap the elements from row s to 
the current pivot row, r. It does not concern itself with column r or 
any column j > r. 


#ifdef PROTOTYPE 


Helse 


void swap_rows_left_of_pivot(Double_Matrix_Type *A, int r, int s) 
void swap_rows_left_of_pivot(A, r, 8) 
Double_Matrix_Type *A; 
int rt; 
8, 


#endif 


double tmp; 


Into); 
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851 

852 

853 for (j = 0; j < r; j++) f 

854 

855 tmp = A->matrix[r] [jl]; 

856 A->matrix(r][jJ] = A->matrix(s][jJ; 
857 A->matrix(s](jJ = tmp; 

858 y 


Mn Smandehwap ToWl,16ft_of pivot () -—---------~-~--3~---—-- nnn nn  n - = * / 


B67 /*® ----------- SS SSSSSS=== FUNCTION DEFINITION softs fcfcscocoo=------------- 


* 

* This function performs updates to a permutation vector, v[], of length 
870 * ‘size’. The pivot_index indicates the row or column where the next 

* pivot has been located; and k indicates the stage, or the row and 

* column where the pivot is to be installed. 


875 #ifdef PROTOTYPE 

Bae void update_permutation(int v{J, int size, int k, int pivot_index) 
879 #else 

881 void update_permutation(v, size, kK, pivot_index) 


883 inte ville 

B84 size, 

B85 kK, 

BS6 pivot_index; 


gas #tendif 
339 { 
890 imc 1; 


893 i = v(k]; v(k] = v[pivot_index] ; v[pivot_index] = i; 
394 } 
eee Cl UD ole mn eran Ulta CaO ( )/) eae wm nn nnn nn + / 


901 #ifdef PROTOTYPE /* sosssssssscscsssssssssssssssssssssssssssssssssesc= #/ 
902 

903 main(int argc, char *argv[]) 

904 

905 #else 

906 

907 main(argce, argv) 

908 

909 int argc; 

910 char *argv[]; 

911 

912 #endif 

913 { 

914 

915 /*® arr ono SSSaSssssa VARIABLE DEFINITIONS SSSSSSSSSee——-------- * / 
916 

917 double a, /* denominator of growth factor (g/a) ¥*/ 
918 *cbuf, /* col buffer holds one col at atime +*/ 
919 **dtime, /* doubles corresponding to ticks **t +*/ 
920 eps = epsd(), /* machine precision (see machine.h) * / 
921 g = 0.0, /* the growth factor * / 
922 root_time, /* time measured at root for iterations */ 
923 tol. /* tolerance * / 
924 

925 Double_Matrix_Type *A, /* This A gets operated upon/changed » / 
926 *A0; /* The original copy of A * / 
927 

925 int cubesize, /* number of processors in the cube * / 
929 din, /* dimension of the hypercube * / 
930 i 

931 a 

932 m, /* number of rows in A * / 
933 me, /* root processor’s id * / 
934 n, /* number of cols in A */ 
935 *q, /* row permutation vector +/ 
936 coe /* numerical rank estimate * / 
937 timing, /* Boolean «/ 
938 verbose; /* Boolean * / 
939 

940 long sizeof_col, /* sizes, in bytes */ 
941 sizeof_int, 

942 sizeof_pivot; 

943 

944 ticks root_start, 

945 ELroot. /* time measured at root transputer * / 
946 **t; /* time data: row => node, col => event */ 
947 

948 Pivot_Type pivot; /* pivot */ 
949 

950 
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951 /* were mm rrr nn SSS SeSrce==== INITIALIZATIONS Str rrssssstste----- -- -- + / 
953 #ifdef TRANSPUTER 


955 /* Add 1M to the heap to allow for generation of large matrices */ 
956 addfree((void *) _heapend, 0x100000); 


958 #endif 

960 Prints ('\n,s \n\n" av@reion); 

962 define_valid_args(); 

964 resolve_args(argc, argv, &dim, &timing, &verbose); 


966 A = generate(&m, &n); 


968 sizeof_col (long) (A->rows * sizeof (double)); 
969 sizeof_int (long) sizeof(int) ; 
970 sizeof_pivot = (long) sizeof(Pivot_Type); 


972 if ('(cbuf = (double *) malloc(sizeof_col))) { 

974 printi("main(): Allocation failure.for cbuf[{].\n"); 
975 exit (EXIT_FAILURE) ; 

978 cubesize = POW2(dim); 

980 #ifdef TRANSPUTER 

982 initialize_hypercube(dim) ; 

984 #else 

986 cubename = initialize_hypercube(dim, nodecode); 


988 #endif 


991 me = myhost(); 
993 if (verbose) { 
995 if ('CAO = matalloc(m,n))) { 


997 printf("Allocation failure for AO.\n"); 
998 exit (EXIT_FAILURE) ; 


1000 
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1001 strcpy(AO->name, “Original A"); 

1002 

1003 for (i = 0; i < A->rows; i++) { 

1004 for (j = 0; j < A->cols; jtt+) { 

1005 

1006 AO->matrix[i}(j] = A->matrix(i) (j]; 

1007 } 

1008 } 

1009 printf("\n\nA has been allocated and generated.\n\n"); 

1010 printmd(*A, WIDTH, AFT); 

1011 printf("\n\nSending size(A) to the nodes.\n\n"); 

1012 } 

1013 

1014 

1015 #ifdef TRANSPUTER 

1016 

1017 cubecast(me, dim, (char *) &m, sizeof_int, cubesize); 

1018 cubecast(me, dim, (char *) én, sizeof_int, cubesize); 

1019 cubecast(me, dim, (char *) &timing, sizeof_int, cubesize); 

1020 

1021 #else /* iPSC/2 ¥*/ 

1022 

1023 cubecast(me, dim, (char *) &m, sizeof_int, ROW_SIZE_TYPE) ; 
1024 cubecast(me, dim, (char *) &n, sizeof_int, COL_SIZE_TYPE); 
1025 cubecast(me, dim, (char *) &timing, sizeof_int, ARG_TYPE); 

1026 

1027 #endif 

1028 

1029 if (verbose) printf("\nSent size(A) to nodes.\n"); 

1030 

1031 distribute_columns(A, dim, cbuf); 

1032 

1033 q = initial_permutation_vector(m); 

1034 

1035 

1036 /* FINAL PREPARATIONS BEFORE STARTING THE ITERATION ------------------- 
1037 * 

1038 * Get the first pivot from node 0. Initialize the growth factor 
1039 * variables, g and a, so that we can compute growth factor (g/a) as 
1040 * we go. Set a reasonable tolerance. 

1041 * 

1042 #222 ----------- - +--+ +--+ 5 5 - - -  - - - - - - $$ - - $= = == 
1043  / 

1044 

1045 #ifdef TRANSPUTER 

1046 

1047 receive(O, (char *) &pivot, sizeof_pivot, cubesize); 

1048 

1049 #else /* iPSC/2 */ 

1050 
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1051 receive(O, (char *) &pivot, sizeof_pivot, PIVOT_TYPE) ; 

1052 

1053 #endif /* TRANSPUTER */ 

1054 

1055 

1056 a = g = MAX(g, fabs(pivot.u)); 

1057 

1058 tol = (MIN(m,n)) * g * eps; 

1059 

1060 

1061 [SeECOGUINNING OF Sl TERA ION ——-———————_—————————-——=———=—=-------_~-___—— 
1062 

1063 * We enter with A established and knowledge of the first pivot. 
1064 

1065 ee ee ee a ee 
1066 oy 

1067 

1068 #ifdef TRANSPUTER 

1069 

1070 root_start = clock(): 

1071 

1072 #endif 

1073 

1074 printf("Beginning iterations.\n\n"); 

1075 

1076 for (r = 0; r < (MIN(m,n)); r++) { 

1077 

1078 if (pivot.id == RANK_DEFICIENT) break; 

1079 

1080 /* We expect to receive cbuf{] in the correct (i.e., already 
1081 * swapped) order. Before we stuff cbuff) into A{J[J], we’ll swap 
1082 * rows left of the pivot column, and then insert the new pivot 
1083 * column. 

1084 */ 

1085 

1086 #ifdef TRANSPUTER 

1087 

1088 receive(O, (char *) cbuf, sizeof_col, cubesize); 

1089 

1090 #else /* iPSC/2 */ 

1091 

1092 receive(0O, (char *) cbuf, sizeof_col, PCOL_TYPE); 

1093 

1094 #endif /* TRANSPUTER */ 

1095 

1096 g = MAX(g, fabs(pivot.u)); 

1097 

1098 update_permutation(q, m, r, pivot.s); 

1099 

1100 if (pivot.s != r) swap_rows_left_of_pivot(A, r, pivot.s); 
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1101 
1102 
1103 
1104 
1105 
1106 
1107 
1108 
1109 
1110 
i EA 
1y12 
1113 
1114 
1115 
1116 
1117 
1118 
errs 
1120 
ry21 
1122 
1123 
1124 
1B 
1126 
112% 
P1238 
big 
1130 
1131 
1132 
1133 
1134 
1135 
1136 
1137 
1138 
1139 
1140 
1141 
1142 
1143 
1144 
1145 
1146 
1147 
1148 
1149 
1150 


for (i = 0; i < A->rows; i++) { A->matrix{i][r] = cbuf[i]; } 


if (verbose) { 


I 


printf("Host: Stage 4d, Pivot value = %e. ", r, pivot.u); 
printf("Growth factor = %e.\n", (g/a)); 

printf("q = "); printvi(q, A->rows, WIDTH); 

princi ("\n')); 


if (r < ((MIN(m,n)) - 1)) f{ 


I; 


t_root 


#ifdef TRANSPUTER 


receive(O, (char *) &pivot, sizeof_pivot, cubesize); 


/* iPSC/2 */ 


receive(0O, (char *) &pivot, sizeof_pivot, PIVOT_TYPE); 


#Hendif /* TRANSPUTER */ 


} /* end for(r) 22-99-99 99-99 nnn rn nner nnnnnn-=- » / 


#ifdef TRANSPUTER 


(clock() - root_start); 


if (timing) { 
root_time = ((double) t_root) * LO_PERIOD; 


printf("\n\nRoot transputer: "); 
printf("Time for iterations: %8.41f seconds\n\n", root_time) ; 


free(cbuf); 


I have selected the easy way out and assumed A has full rank. If 
you did not make this assumption, you would need to collect the 


remaining columns at this point. 


340 


1152 if (timing) dtime = receive_timing data(cubesize) ; 

1153 

1154 

1155 /* There is no more use for the nodes, so they can be released. */ 
1156 

1157 #ifndef TRANSPUTER 

1158 printf("\n\nmain(): Killing and releasing cube.\n\n"); 

1159 killcube(ALL_NODES, ALL_PIDS) ; 

1160 relcube(cubename) ; 

1161 #endif 

1162 

1163 if (verbose) { /* Create and show Q’, AO, P, L, U.... ----------- * / 
1164 

1165 show_resulting_matrices(A, AO, q); 

1166 

1167 } 

1168 

1169 

1170 if (timing) display_timing_data(A, dim, a, eps, g, tol, r, dtime); 
71 

nig J 

1173 /* ------------============= EOF gfpphost.c ============------------- x / 


34] 


) /*® secre o-oo S========= PROGRAM INFORMATION ==========------------- 
2 * 

3 * SOURCE : gippnode.c 

4 * VERSION : 2.0 

5 * DATE : 21 September 1991 

6 * AUTHOR : £=Jonathan E. Hartman, U. S. MNaval Postgraduate School 

7 * REMARKS : See gf.h. 

a + 

9 %e ewemm ewww ew HK we SSS SSS SS SSS SSS SSS SS SS SSS SS Se See See ee ee eee eee ee eee eon nr nnn 
10 #/ 

11 

12 #include <math.h> 

13 

14 #ifdef TRANSPUTER 

15 

16 #include <conc.h> 

Ny 


18 #include <matrix.h> 
19 #include <macros.h> 
20 #include <allocate.h> 
21 #include <comm.h> 

22 #include <generate.h> 
23 #include <mathx.h> 

24 #include <ops.h> 

25 #include <timing.h> 


27 #else 


29 #include "/usr/hartman/matlib/matrix.h" 
30 #include “/usr/hartman/matlib/macros.h" 
31 #include "/usr/hartman/matlib/allocate.h" 
32 #include "/usr/hartman/matlib/comm.h" 

33 #include "“/usr/hartman/matlib/generate.h" 
34 #include "/usr/hartman/matlib/mathx.h" 

35 #include "/usr/hartman/matlib/ops.h" 

36 #include "“/usr/hartman/matlib/timing.h" 
37 #endif 


39 #include “gf.h" 
41 #ifdef TRANSPUTER 


43 Channel *ic[{(CUBESIZE + 1)], 
44 *oc([(CUBESIZE + 1)]; 


46 #endif 


48 


49 ticks t[MAX_EVENTS] ; 
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51 
52 
53 
54 
55 
56 
57 
58 


86 


[* ------------SSSSSSes2== FUNCTION DEFINITION sates ssssrs---- ---- ---- 


x 

* This function is kind of an inverse for local_column(). Given some 
* column number (local_column) held at this node, the function returns 
* the corresponding column number in the global/host copy of the full- 
* sized A. This could be implemented more efficiently as a macro. 


#ifdef PROTOTYPE 
int global_column(int local_column, int me, int cubesize) 
#else 
int global_column(local_column, me, cubesize) 
InteloOcGal column, 
me, 
cubesize; 
#endif 


{ 


return(local_column * cubesize + me); 


a 


/* End global_column() ---------~---------------------------------~------ + / 


[* ~-----------SSsSeeez=== FUNCTION DEFINITION Ssessseeee5---+-----+----- 
* 

* This function maps a column number in the global A (the full-sized A 

* held at the root processor/host) to the corresponding local column num- 
* 

* 


ber. If the global_column is not one that is held at this node, a 
negative value (-1) is returned. 


#Hifdef PROTOTYPE 
int local_column(int global_column, int me, int cubesize) 
Helse 


Int local_column(global_column, me, cubesize) 


343 


101 int global_column, 

102 me, 

103 cubesize; 

104 #endif 

105 { 

106 if ((global_column % cubesize) != me) return(-1); 

107 

108 return((int) global_column / cubesize) ; 

109 } 

110 /* End local_column() ------- rrr rrr nr rrr nr tn rr rrr rrr rr rrr nna * / 
Ha 

112 

113 

114 

115 

116 /® terre rrr nH HH SSSSSSsssaz FUNCTION DEFINITION sssss2tsss=5------------ 
117 * 

118 */ 


120 #ifdef PROTOTYPE 


122 void do_pivot_column_arithmetic(Double_Matrix_Type *A, double *cbuf, 
123 int k, int me, int cubesize) 


125 #else 

127 void do_pivot_column_arithmetic(A, cbuf, k, me, cubesize) 
129 Double_Matrix_Type *A; 

130 double *cbuf; 

131 anit k, 

132 me, 

133 cubesize; 

135 #endif 

i36 { 

137 double pivot_value; 

139 int We 

140 pivot_column; 

143 pivot_column = local_column(k, me, cubesize); 
145 pivot_value = A->matrix[k] [pivot_column] ; 


148 /* Divide everything under the pivot by the pivot value */ 
149 for (i = (k+1); i < A->rows; i++) { 


344 


A->matrix(i] (pivot_column) /= pivot_value; 


} 
/* This is somewhat redundant, and not optimal with respect to 
* efficiency, but it works and reads clearly, right? 
*/ 
for (i = 0; i < A->rows; i++) cbuf{i] = A->matrix{i) [pivot_column] ; 
i 
Peers cmDVOleGcotunn ani time ldic) 6 = —————==—=——>=—>S—54— 5555 ===2=--- * / 
/* wee eo - - - - - = - a FUNCTION DEFINITION Serssssesessseme = - = - - - - 
* 
* This function accepts the matrix, the global column number for this 
* stage (where the pivot will be taken from), and a pivot structure to be 
* filled....among other things....and ‘returns’ the row, s, and value, u, 
* of the new pivot in global column r (local column lc). 
*/ 
#ifdef PROTOTYPE 
void locate_pivot(int me, int cubesize, Double_Matrix_Type *A, int r, 
Pivot_Type *pivot) 
#else 
void locate_pivot(me, cubesize, A, r, pivot) 
int me, 
cubesize; 
Double_Matrix_Type *A; 
int i; 
Pivot_Type *pivot; 
#endif 
{ 
Int <2; 
pivot_column; 
pivot_column = local_column(r, me, cubesize); 
/* Initialize pivot row and value */ 
Ppivot->s = r; 
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201 
202 
203 
204 


~ 
* + &%& &%& & %& 0% HH HH HR Hh HR H H He Hh Hh MH Hh BR 


% 
ies. 


pivot->u = A->matrix[r] [pivot_column] ; 


for (i = (r+1); i < A->rows; i++) { 


if (fabs(A->matrix[i] [pivot_column]) > fabs(pivot->u)) { 


pivot->s = i; 
pivot->u = A->matrix([iJ [pivot_column] ; 
: 
: 
End locate_pivot() --------- <9 nnn nnn nnn rn enn nnn nn nnn nnn nn- « / 


we ee ee Sees FUNCTION DEFINITION Sestestsssssese----- --- - - 


Receive this node’s columns from the root/host processor (manager), 
place them into the column buffer, then transfer them into A while 
the other processors are communicating with the root. 


The transputer scheme is a bit more involved. Here nodes 0000 and 1000 
are connected to the root and they must receive for everyone. They (0 
and 8) are not directly connected to everyone, so the columns must be 
passed out in cycles. For instance, suppose we used the hybrid 4-cube. 
Then nodes 0 and 8 would receive bursts of 8 columns at a time. They 
would keep the first one (we’1]1 call it column O in some sort of rela- 
tive numbering scheme that abides by the C numbering convention), send 
the next one (col 1) in the Ox1 direction, the next to the Ox2 direc- 
tion, column 3 in the Oxi direction, column 4 in the 0x4 direction, 
column 5 in the Ox1 direction, column 6 in the Ox2 direction, and 
lastly, column 7 in the 0x1 direction. This makes cycle == 8 for nodes 
0000 and 1000. Similarly, nodes x001 have a cycle of four where they 
keep the first column to arrive and then send the next three to direc- 
tions 0x2, 0x4, and 0x2 in turn. This distribution pattern is main- 
tained until all of the columns have been distributed. 


#ifdef PROTOTYPE 


void receive_columns(int din, 
int node, 
Double_Matrix_Type *A, 
int n, 
double *xcbuf, 
int my_cols, 
int colsize) 
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251 

252 #else 

253 

254 void receive_columns(dim, node, A, n, cbuf, my_cols, colsize) 

255 

256 int dim, 

257 node; 

258 Double_Matrix_Type *A; 

259 bane n; 

260 double *cbuf; 

261 int my_cols, 

262 colsize; 

263 

264 #endif 

gags { 

266 int cubesize = pow2(dim), 

267 cycle, /* length of typical col burst */ 
268 dimeff = MIN(3, dim), /* effective dimension «/ 
269 fron, /* node that I receive from * / 
270 Bey /* global column index se 
27) ai 

272 idx, /* index into tof] */ 
273 ive =O” /* local column index * / 
274 ldeff, /* effective least_dimension() */ 
aa5 nodeff = (node ¥% 8), /* effective node number «/ 
276 others, /* no. of nodes in other 3-cube */ 
277 step, /* for destination of cols rec’d*/ 
278 thehost = myhost(), 

279 to[8): /* ==> direction to send to * / 
280 

281 

282 #ifdef TRANSPUTER 

283 

284 ldeff = least_dimension(nodeff) ; 

285 

286 if (nodeff == 0) from = myhost(); 

287 else from = node ~ pow2(ldeff - 1); 

288 

289 /* cycle describes the length of a cycle that starts with me (node)... 
290 * then I receive several columns for others....then start over with 
291 * me. The nodes in the highest dimension have cycle == 1 ==> self 
292 * only. We also fill tof] with the directions that we will be 

293 * sending to within a given cycle. Wot all nodes use all 8 elements 
294 * of to{]. They only use the first cycle elements. The step is the 
295 * difference between the column numbers received at this node during 
296 * a given burst of length cycle. 

297 * 

298 * When we use the hybrid 4-cube, we are treating it as two 3-cubes, 
299 * so the variable others is set to 8. This is because there are 8 
300 * other columns between every burst that comes to the 3-cube that 
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301 * node is in. 
302 */ 
303 cycle 
304 

305 (dim == 4) ? (others = 8) : (others = 0); 
306 

307 step 


pow2(dimeff - ldeff); 


pow2(ldeff) ; 


309 to[0] = 0; 

310 to[i] = to[3] = to[{5] = tol{7] = pow2(ldeff); 

311 to[2] = tof{6] = pow2(ldeff + 1); 
312 to[4] = pow2(ldeff + 2); 


315 for (gc = node; gc <n; gc += (others + step)) { 

317 receive(from, (char *) cbuf, colsize, cubesize); 

319 for (i = 0; i < A->rows; itt) A->matrix[iJ{1c] = cbuf[i]; 

321 Vest 

323 for (idx = 1; idx < cycle; idx++) { 

325 gc += step; 

327 if (gc < nyt 

329 receive(from, (char *) cbuf, colsize, cubesize); 

331 directional_send(node, dim, to[idx], (char*) cbuf, colsize); 
333 } 


335 } /* end for(gc) */ 


338 #else /* iPSC/2 */ 
340 for (le = 0; Ye <tmy_colésPllce++)"{ 
342 receive(thehost, (char *) cbuf, colsize, COL_TYPE); 


344 for (i = 0; i < A->rows; it+) { A->matrix[{iJ{1c]) = cbuffi]J; } 

345 } 

346 

347 #endif /* TRANSPUTER */ 

348 

349 } 

350 /* End receive_columns() ----------------9 9-9-9 nn nnn nnn nnn » / 


351 
352 
353 


356 [/[& croc eon Se sssss=s== FUNCTION DEFIWITION SSSeessssssse< enn eee 


358 * This function sends in the timing data that is held in t[]. 
359 #/ 


360 

361 #ifdef PROTOTYPE 

362 

363 void submit_timing_data(int node, int dim) 
364 

365 #else 

366 

367 void submit_timing_data(node, dim) 

368 

369 int node, 

370 dim; 

37 

372 #endif 

373 { 

374 int dimeff = MIN(dim, 3), 

375 Gir. 

376 a 

377 rd = least_dimension(node % 8), 
378 nodeff = (node % 8), 

379 root = myhost(); 

380 

381 long cubesize = pow2(dim), 

382 tlen; 

383 

384 

385 tlen = (long) (MAX_EVENTS * sizeof(ticks)); 
386 

387 #ifdef TRANSPUTER 

388 

389 submit(node, dim, (char *) t, tlen, cubesize); 
390 

391 if (dimeff == ld) return; 

392 

393 if ((nodeff == 2) || (nodeff == 3)) { 

394 

395 if (dimeff > 2) { 

396 directional_receive(node, dim, 0x4, (char *) t, tlen); 
397 submit(node, dim, (char *) t, tlen, cubesize); 
398 } 

399 return; 

400 } 
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401 

402 if (nodeff == 1) { 

403 

404 if (dimeff > 1) { 

405 

406 directional_receive(node, dim, 0x2, (char *) t, tlen); 

407 submit(node, dim, (char *) t, tien, cubesize); 

408 } 

409 

410 if Kdamett > 92)04 

411 

412 directional_receive(node, dim, 0x4, (char *) t, tlen); 

413 submit(node, dim, (char *) t, tlen, cubesize); 

414 directional_receive(node, dim, 0x2, (char *) t, tlen); 

415 submit(node, dim, (char *) t, tlen, cubesize); 

416 } 

417 

418 return; 

419 } 

420 

421 if (nodeff == 0) { 

422 

423 if (dimeff > 0) { 

424 

425 /* yretrans from 1 OY Q worm t rrr rr nmr nnn nn nnn rer eter ere n-- + / 
426 directional_receive(node, dim, Oxi, (char *) t, tlen); 

427 submit(node, dim, (char *) t, tlen, cubesize); 

428 1; 

429 

430 if (dimetie> 1) 4 

431 

432 /* yretrans from 2 or 10 -rreee eee e rere en - HH - */ 
433 directional_receive(node, dim, 0x2, (char *) t, tlen); 

434 submit(node, dim, (char *) t, tlen, cubesize); 

435 /* yretrans from 3 or 11 coor eoe eee enn rn + / 
436 directional_receive(node, dim, Oxi, (char *) t, tlen); 

437 submit(node, dim, (char *) t, tlen, cubesize); 

438 } 

439 

440 if (dimeff > 2) f{ 

441 

442 /* retrans from 4 or 12 ---------------9 nnn + / 
443 directional_receive(node, dim, 0x4, (char *) t, tlen); 

444 submit(node, dim, (char *) t, tlen, cubesize); 

445 /* yretrans from 5 or 13 cece e rrr errr enn ener rt rrr rrr rrr * / 
446 directional_receive(node, dim, Oxi, (char *) t, tlen); 

447 submit(node, dim, (char *) t, tlen, cubesize); 

448 /* yretrans from 6 or 14  ------roo meron nnn nn nnn rn cn eeee-H=- » / 
449 directional_receive(node, dim, Ox2, (char *) t, tlen); 

450 submit(node, dim, (char *) t, tlen, cubesize); 
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451 /* yretrans from 7 or 1§  -----enr orn rr rrr rrr rrr nae «/ 
452 directional_receive(node, dim, Ox1, (char *) t, tlen); 

453 submit(node, dim, (char *) t, tlen, cubesize); 

454 } 


458 Helse /* iPSC/2 */ 

460 delay(1.0 + 2.0 * (float) node); 

462 send(root, (char *) t, tlen, (node + NODE_OFFSET)); 
464 #endif /* TRANSPUTER */ 


466 } 
See EET Gey oene Vince Ca tick) ee en ean «/ 


473 /* wee wm me ee ee Seer ssssss= FUNCTION DEFINITION srtsssrssraas-------- ---- 


475 * This function performs the required operations on the Gauss Transform 
476 * area, G, of A and searches for the next pivot. 
477 / 


479 #ifdef PROTOTYPE 


48} void update_G(Double_Matrix_Type *A, double *cbuf, 
452 int cubesize, int k, int me, int n, Pivot_Type *pivot) 


454 #else 
486 void update_G(A, cbuf, cubesize, k, me, n, pivot) 


488 Double_Matrix_Type *A; 

489 double *cbuf; 
490 Int cubesize, 
491 KK 

492 me, 

493 nh: 

494 Pivot_Type *PDivot; 


496 #endif 
497 { 


498 bee el. 


500 Zona CO, /* global column number */ 
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501 1G.—50:: /* local column number to start */ 
502 

503 ticks start; 

504 

505 

506 while ((gc = global_column(lc, me, cubesize)) <= k) Ilct+; 

507 

508 

509 /* The pivot row is k and we know that lc is the first local column to 
510 * the right of kK. Wow we must move through the Gauss Transform area, 
511 all A(i,j) where i > k and j > kK, and perform the operation: 

512 
513 
514 */ 

515 

516 start = clock(); 

517 

518 for (i = kt+ti; i < A->rows; i++) { 

519 

520 for (j = 1c; j < A->cols; j++) { 

521 

522 A~>matrix[i][j] -= (cbuf[i] * A->matrix[k][j]); 

523 

524 } /* end for(j) */ 

525 

526 } /* end for(i) */ 

527 

528 t(LOOPTIME] += (clock() - start); 

529 

530 } 

Sat /* Eric Wye t CG) ie aa */ 
532 

533 

534 

536 

537 main(){ 

538 

539 double *cbuf; /* column buffer holds one col of A x / 
540 

541 Double_Matrix_Type ¥*A; /* this node’s portion of the matrix A */ 
542 

543 int cubesize, /* number of processors in the cube + / 
544 din, /* dimension of the hypercube */ 
545 BC, /* global column number */ 
546 ar /* generic integer and row ctr */ 
547 a /* generic integer and col ctr * / 
548 k, /* index to pivot + / 
549 n, /* number of rows in A (same local/all) */ 
550 me, /* id of this processor «/ 


* 


+ 


ACi,j) = ACGi,j) - AGi,k) * A(k,j) <==> aCi,j) -= cbuf[i]*a(x, j) 
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55) my_cols = 0, /* number of cols in local portion of A */ 
552 n, /* number of cols in all of A * / 
553 root, /* host/root processor id */ 
554 timing; /* Boolean * / 
555 

556 long sizeof_col, /* sizes, in bytes +/ 
557 sizeof_int, 

558 sizeof_pivot; 

559 

560 ticks start, 

561 starti; /* another start + / 
562 

563 Pivot_Type pivot; 

564 

565 

566 

567 [# enon nnn nS SSsss= INITIALIZATION WORK SSSSSsSSe-- eee x / 
568 

569 for (i = 0; i < MAX_EVENTS; it+) tli] = 0; 

570 

57) seart = t{START_TIME] = clock(): 

572 

573 

574 #ifdef TRANSPUTER 

575 

576 cubesize = CUBESIZE; 

577 dim DIMENSION; 

578 initialize_hypercube(dim) ; 

579 

580 #else 

581 

582 cubesize = (int) numnodes(); 

583 dim (int) nodedim(); 

584 

585 #endif 

586 

587 t [DATA_SOURCE] 
588 root 

589 

590 sizeof_int 

591 sizeof_pivot 
592 

593 

594 /* BROADCAST THE SIZE(A) -------------- 2-0-9 n= 
595 
596 
597 
598 
599 
600 


me = (int) mynode(); 
(int) myhost(); 


(long) sizeof(int); 
(long) sizeof (Pivot_Type) ; 


All node processors need to know the number of rows and columns in 
the matrix A [i.e., size(A)]. A broadcast to the entire cube, 
cubecast(), is used to achieve this. The nodes also need to know 
whether or not to set timing on, 80 this value is passed too. 


* #* &© & & 


601 «/ 

602 

603 #ifdef TRANSPUTER 

604 

605 cubecast(me, dim, (char *) &m, sizeof_int, cubesize); 

606 cubecast(me, dim, (char *) &n, sizeof_int, cubesize); 

607 cubecast(me, dim, (char *) &timing, sizeof_int, cubesize) ; 

608 

609 #else /* iPSC/2 */ 

610 

611 cubecast(me, dim, (char *) &m, sizeof_int, ROW_SIZE_TYPE); 

612 cubecast(me, dim, (char *) &n, sizeof_int, COL_SIZE_TYPE) ; 

613 cubecast(me, dim, (char *) &timing, sizeof_int, ARG_TYPE) ; 

614 

615 #endif /* TRANSPUTER */ 

616 

617 sizeof_col = (long) (m * sizeof(double)) ; 

618 

619 

620 /* COLUMN BUFFER «AND COUNDER se ee 
621 © 

622 * The column buffer, cbuf[{], will be used to hold one column of A at 
623 * atime. We will see cbuf{] used on a variety of occasions when we 
624 * must work with a column of A. Allocate cbuf{] and determine the 
625 * number of columns that will be stored locally (my_cols). 

626 * 

627 * / 

628 cbuf = (double *) malloc(sizeof_col) ; 

629 

630 for (i = 0; i <n; itt) { if ((i % cubesize) == me) my_cols++; } 

631 

632 

633 /* ESTABLISH LOCAL A ----n nner nnn rer reer 
634 

635 * Allocate storage space for this node’s part of A (it is called A 
636 * even though it is only part of A). 

637 « / 

638 

639 A = matalloc(m, my_cols); 

640 

641 t{SETUP] = clock() - start; 

642 

643 start = clock(); 

644 

645 receive_columns(dim, me, A, n, cbuf, my_cols, sizeof_col); 

646 

647 t(DISTRIB_COLS] = clock() - start; 

648 

649 

650 /* BEGIN ITERATION ------------------- 292-299-9992 59 9-5 ee 


651 * 

652 * 1.) At the top of the for() loop we have just completed update_G(), 
653 * so the local candidate for the next pivot is situated in np[0]. 
654 * The function elect_next_pivot() performs a series of directional_ 
655 * exchange()s so that all local candidates compete in an election 

656 * process. The winner is np[0). 

657 ‘ 

658 * 2.) If all went well, np[0]) contains the next pivot. This informa- 
659 * 

660 * 3.) If this node has the pivot column [if (p[k] == gc)], it must 
661 * divide everything under the pivot by the value of the pivot and 
662 * distribute the column to all other nodes (node zero sends to host). 
663 . 

664 * 4.) Finally, this node must perform the computations across the 

665 + Gauss Transform area for the local portion of A. The 

666 * update_G() function also locates the next pivot without special 

667 * expense. Then it is time to go back to the top of the loop. 

668 7 

669 

670 start = clock(); 

671 

672 for (k = 0; k < (MIN(m,n)); k++) { 

673 

674 pivot.id = k % cubesize; 

675 pivot.t =k; 

676 

677 /* know id; k ==> t; needs, u */ 

678 

679 if (pivot.id == me) locate_pivot(me, cubesize, A, k, &pivot); 

680 

681 cubecast_from(pivot.id, me, dim, (char *) &pivot, sizeof_pivot) ; 
682 

683 if (me == 0) { 

684 

685 starti = clock(); 

686 

687 #ifdef TRANSPUTER 

688 

689 send(root, (char *) &pivot, sizeof_pivot, cubesize); 

690 

691 #else /* iPSC/2 */ 

692 

693 send(root, (char *) &pivot, sizeof_pivot, PIVOT_TYPE); 

694 

695 #endif /* TRANSPUTER */ 

696 

697 t{[PIVOTS_TO_BOST]) += (clock() - starti); 

698 } 

699 

700 swap_rows(A, k, pivot.s); 
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starti = clock(); 
if (pivot.id == me) { 


do_pivot_column_arithmetic(A, cbuf, k, me, cubesize) ; 


} 

t{PCOL_ARITHMETIC) += (clock() - starti); 

starti = clock(); 

cubecast_from(pivot.id, me, dim, (char *) cbuf, sizeof_col); 


t{PCOL_DISTRIB) += (clock() - starti); 


if (me == 0) { 


starti = clock(); 


#ifdef TRANSPUTER 


submit(me, dim, (char *) cbuf, sizeof_col, cubesize); 


Helse /* iPSC/2 */ 


submit(me, dim, (char *) cbuf, sizeof_col, PCOL_TYPE); 


#Hendif /* TRANSPUTER */ 


t{PCOLS_TO_HOST] += (clock() - starti); 


starti = clock(); 
update_G(A, cbuf, cubesize, k, me, n, &pivot); 
t{UPDATING_G] += (clock() - starti); 


} 
/* END ITERATION [for(k...)] --------------------------------------- +/ 


t{ITERATION] = clock() - start; 


free(cbuf); 
t [STOP] = clock(); 


if (timing) submit_timing_data(me, dim); 


3.56 


75) return(SUCCESS) ; 
752 } 
2 eee a a Se oe oe a 


gfpcnode.c 


om A on fb WwW AD = 
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[* -------------=SSSS5==== PROGRAM INFORMATION ========== 
* 

* SOURCE : gfipcnode.c 

* VERSION : 2.3 

* DATE : 17 September 1991 

* AUTHOR : Jonathan E. Hartman, U. S. Waval Postgraduate 
* REMARKS : see gi-h- 

* 

% sees ee Sees eSe eS Sooo Sooo -—-e0-e0e e665 5506055 55 SSS... SS SSS SSS SS 
*/ 


#Hinclude <math.h> 
#ifdef TRANSPUTER 
#include <conc.h> 


#include <matrix.h> 
#include <macros.h> 
#include <allocate.h> 
#include <comm.h> 
#include <generate.h> 
#include <mathx.h> 
#include <ops.h> 
#include <timing.h> 


Helse 


#Hinclude "/usr/hartman/matlib/matrix.h" 
#include "/usr/hartman/matlib/macros.h" 
#include "/usr/hartman/matlib/allocate.h" 
#include "/usr/hartman/matlib/comm.h" 
#include "/usr/hartman/matlib/generate.h" 
#include "/usr/hartman/matlib/mathx.h" 


#include "/usr/hartman/matlib/ops.h" 
#include "/usr/hartman/matlib/timing.h" 
#endif 

#include "gf.h" 


#ifdef TRANSPUTER 


Channel *ic{(CUBESIZE + 1)], 
*oc((CUBESIZE + 1)]; 


#endif 


ticks t(MAX_EVENTS]; 


School 


5] /* we ee ee ee ee ee See esses FUNCTION DEFIMITIOW Soescsssesesessi------------ 
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+ + + + + + & 


+ 
~ 


After this node finds its candidate for next pivot, there must be a 
comparison with all other nodes. The local candidate starts in np([0]. 
Direction-by-direction, candidates are exchanged and the winner is 
positioned in np{0]. If there is a tie, the candidate from the smaller 
node number wins. A RANK_DEFICIENT opponent is ignored (the local 
candidate must be at least as good). In the end, all processors have 
identical entries in np([0]. 


#ifdef PROTOTYPE 


void elect_next_pivot(int me, int dim, Pivot_Type *np) 


Helse 


void elect_next_pivot(me, dim, np) 
int me, 
din; 
Pivot_Type *np; 


#endif 


{ 


ant dir: 


long cubesize = pow2(dim), 
len sizeof(Pivot_Type); 


fore(dir - 1° dir < (int) cubesize: dir <<= 1) { 
if (dir != 8) { 
directional_exchange(me, dim, dir, (char *) &(npf[1i]J), 
(char *) &(np[0]), len); 
} 
else { 


if ((me % 8) != 0) { /* we don’t want 0 <--> 8 comm */ 


directional_exchange(me, dim, dir, (char *) &(np{1]), 
(char *) &(np[0]), len); 


101 if (mp([i].id != RANK_DEFICIENT) { 

102 

103 if (fabs(np{1].u) > fabs(np[0].u)) { 
104 

105 np{0].id 
106 np[0].s 
107 } 

108 else { 

109 

110 if (fabs(np[iJ.u) == fabs(np[0].u)) { 

111 

112 if (np({i].id < np{[0].id) { /* smallest breaks tie */ 
113 

114 np(0].id 
115 np(0].s 
116 } 

ia } 

118 } 

119 

120 }. /esend atinp(1)> id... )e%/ 

121 

122 } /* end for(dir) */ 

123 

124 

125 /* Since there is no direct connection between nodes 0 and 8, we once 
126 * again destroy the beauty and generality of the hypercube so that we 
127 * can be sure that O and 8 have the best candidate for pivot. 

128 * / 

129 

130 if (dim == 4) { 

131 

132 if ((me % 8) == 0) { /* Nodes 0000 and 1000 * / 
133 

134 directional_receive(me, dim, Oxi, (char *) np, len); 

135 J; 

136 

137 if ((me % 8) == 1) { /* Wodes 0001 and 1001 */ 
138 

139 directional_send(me, dim, Oxi, (char *) np, len); 

140 } 

141 } 

142 } 

143 /* End elect_next_pivot() ----------------- ee nn nn nn nnn rrr nnn nnn nnn */ 
144 

145 

146 /* This is only the first part of this file. The rest would be similar to 
147. * gfppnode.c 

148 * 

1490 ® woman nn--n--ssesssssss== = =EOF gfpcnode.c ===========------------- */ 


npli)-id; np[0].u 
np fi); np[0].t 


npc 1K 
mp{1].t; 


nplidead. np(0].u 
Bp lies. npeolet 


np{i].u; 
np(1ij].t; 
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